Previous | Home | Lecture Notes | Solutions | Next

 

Perl Exercises (5)


These exercises are part of a Perl course taught at CNTS - Computational Linguistics at the University of Antwerp.

If you are a course participant, please send the solutions to these exercises to zavrel@uia.ua.ac.be before or on Wednesday March 8th, 2000. Note that the only the first three exercises are obligatory. When you submit your results, please include the Perl code you have written, the result of at least one test and the answers to the questions mentioned in the exercise.

Exercise 5.1 (set difference and intersection)

Write a program that reads two lines of words (further called A and B), stores the words from each line in separate hash, and then prints a sorted list of words marked with the following symbols: "A" (occurs only in line A), "B" (occurs only in B), and "AB" (occurs both in A and B).

Input line A: the corpus is a collection of conversations in British English
Input line B: transcripts of the conversations are also included
Output: 
A   British
A   English
A   a
B   also
B   are
A   collection
AB  conversations
A   corpus
A   in
B   included
A   is
AB  of
AB  the
B   transcripts

Exercise 5.2a (word for word translation)

This excercise is an extension of exercise 3.2. Write a program that stores a list of word translations between two languages in a hash, and translates in both directions in a word for word fashion. Make a small lexicon of about twenty non-ambiguous words that allows you to translate some simple sentences.

Exercise 5.2b*

This is a starred exercise which means that you may skip the exercise. Make this exercise only if you think it is interesting and you have some time left.

Extend the translation program from the previous exercise to handle ambiguous words. Hint: use multidimensional hashes and a few simple rules that look at the context of a word.

Exercise 5.3 (bigram statistics)

This is an extension of Exercise 4.4*. Write a program that takes a chunk of text as input, and outputs a list of letter bigrams and unigrams from that text together with their frequency, reverse sorted by frequency. Use hashes! A unigram is a single letter character, a letter bigram is a sequence of two adjacent characters. E.g. "bigram" contains the bigrams "bi ig gr ra am". Ignore case and whitespace.

Example:

Give some input text: this is the uni- and bigram   count!
Unigram frequencies:
i 4
n 3
t 3
u 2
a 2
h 2
s 2
! 1
b 1
c 1
d 1
g 1
o 1
r 1
m 1
e 1
- 1
Bigrams frequencies:
un 2
is 2
th 2
bi 1
he 1
am 1
an 1
ig 1
hi 1
nd 1
co 1
ra 1
t! 1
i- 1
ni 1
gr 1
nt 1
ou 1

5.4* Bigram language model

This is a starred exercise which means that you may skip the exercise. Make this exercise only if you think it is interesting and you have some time left.

Extend the program from exercise 5.3 to compute the probability of a word, assuming the probability of a word is the product of all letter bigram transition probabilities, as given in the following formula:

P(Word) = product_i P(Char_i | Char_i-1)

The bigram transition probabilities for a bigram xy are defined as: P(y|x) = freq(xy)/freq(x). Where freq(xy) is the frequency of a bigram xy, and freq(x) that of unigram x. Your program should first read some amount of text to estimate the probabilities and then ask for words and compute their probabilities.


Previous | Home | Lecture Notes | Solutions | Next
Last update: March 6, 2000. zavrel@uia.ua.ac.be