Previous | Home | Lecture Notes | Exercises | Next

 

Perl Solutions (5)


These exercise solutions are part of a Perl course taught at CNTS - Computational Linguistics at the University of Antwerp.

The programs corresponding with these exercises can be found in the appendix.

Exercise 5.1 (set difference and intersection)

Write a program that reads two lines of words (further called A and B), stores the words from each line in separate hash, and then prints a sorted list of words marked with the following symbols: "A" (occurs only in line A), "B" (occurs only in B), and "AB" (occurs both in A and B).

To do this, we must read two lines and split them into words. We put each of the lines in a hash. There are many ways to print the appropriate symbols, but in the solution given below we choose to merge the two hashes first. Then we can iterate over all the words that are in the merge and print an A if they are in the first line, and a B if they are in the second line. Because we append the two results together, we get the intersection symbol "AB" for free.

Exercise 5.2a (word for word translation)

Write a program that stores a list of word translations between two languages in a hash, and translates in both directions in a word for word fashion. Make a small lexicon of about twenty words that allows you to translate some simple sentences.

We get the reverse translation by reversing the first translation hash. Then we read lines, split them into words, and look up each word in the hashes. For unknown words we print the original string between brackets. It is a word for word translation with little elegance.

Exercise 5.2b*

Extend the translation program from the previous exercise to handle ambiguous words. Hint: use multidimensional hashes and a few simple rules that look at the context of a word.

We indicate the senses of a word in the dictionary with a "_number" suffix. If we do not know a word's direct translation, we first look it up in a hash of ambiguous words and decide on the translation on the basis of a very simple heuristic, looking at the word to the right of it. If nothing matches we give the DEFAULT sense. Of course this is not how you want to do some serious translation, but there are in fact products which do little more.

Exercise 5.3 (bigram statitics)

Write a program that takes a chunk of text as input, and outputs a list of letter bigrams and unigrams from that text together with their frequency, reverse sorted by alphabet. Use hashes! A unigram is a single letter character, a letter bigram is a sequence of two adjacent characters. E.g. "bigram" contains the bigrams "bi ig gr ra am". Ignore case and whitespace.

The program below is a straightforward extension of last week's solution with hashes subsituted for lists.

Exercise 5.4* Bigram language model

Extend the program from exercise 5.3 to compute the probability of a word, assuming the probability of a word is the product of all letter bigram transition probabilities, as given in the following formula:

P(Word) = product_i P(Char_i | Char_i-1)

The bigram transition probabilities for a bigram xy are defined as: P(y|x) = freq(xy)/freq(x). Where freq(xy) is the frequency of a bigram xy, and freq(x) that of unigram x. Your program should first read some amount of text to estimate the probabilities and then ask for words and compute their probabilities.

To make a language model we also need transition probabilities from the start of a word to its first letter, so we need to keep track of those as well. Once we have all probabilities, the value for a word is a simple product. Note that an unseen bigram will produce a zero term, and we want to avoid this, so we add one to all frequencies. The solution below is stolen from Gert Durieux :-)

Appendix

Exercise 5.1

# exercise5.1.pl
# 2000-03-10 zavrel@uia.ua.ac.be

print "Input line A:";
$line_A = <>;                     # read A
@linea  = split /\s+/, $line_A;   # get the words
while($word = pop @linea){        # iterate over words
    if($word =~ /\w/){            # filter out all non-words
       $hash_A{$word}=1;          # put into hash
    }
}

print "Input line B:";            # same as A
$line_B = <>;
@lineb  = split /\s+/, $line_B;
while($word = pop @lineb){
    $hash_B{$word}=1;
}

%hash_merge = ( %hash_A , %hash_B ); # do the merge

foreach $key ( sort keys %hash_merge ){

    if(exists($hash_A{$key})){
	print "A";
    }
    if(exists($hash_B{$key})){
	print "B";
    }    
    print "\t$key\n";
}

Exercise 5.2

# exercise5.2.pl
# 2000-03-10 zavrel@uia.ua.ac.be

%dutch2english = (
		  Jan => "John",
		  ziet => "sees",
		  Marie => "Mary",
		  de => "the",
		  kinderen => "children"
		  );

%english2dutch = reverse %dutch2english;

do{

    print "Give a sentence in Dutch or English for translation:\n> ";
    $line = <>;
    if($line =~ /\w/){
	
	$line =~ s/^\s*//;
	$line =~ s/\s*$//;

	@words = split /\s+/, $line;
	
	foreach $word ( @words ){
	    if(exists($dutch2english{$word})){
		print "$dutch2english{$word} ";
	    }
	    elsif(exists($english2dutch{$word})){
		print "$english2dutch{$word} ";
	    }
	    else{
		print "[$word] "; # unknown
	    }
	}
	print "\n";
    }
}while($line =~ /\w/);

Exercise 5.2b

# exercise5.2b.pl
# 2000-03-10 zavrel@uia.ua.ac.be

%dutch2english = (
		  Jan => "John",
		  ziet => "sees",
		  Marie => "Mary",
		  kinderen => "children",
		  boek => "book",
		  de => "the_1",   # the subscripts denote various senses
		  het => "the_2",
		  haar_1 => "hair",
		  haar_2 => "her",
		  );

%ambiword_english = (
		     the => { 
			      DEFAULT => "de",  # default sense
			      book => "het"     # simple look-right heuristics
			      }
		     );

%ambiword_dutch = (
		   haar => { 
		       DEFAULT => "hair",
		       kinderen => "her" ,
		       boek => "her",
		       haar => "her",
		       }
		   );

%english2dutch = reverse %dutch2english;

do{

    print "Give a sentence in Dutch or English for translation:\n> ";
    $line = <>;
    if($line =~ /\w/){
	
	$line =~ s/^\s*//;
	$line =~ s/\s*$//;

	@words = split /\s+/, $line;

	$position = 0;
	foreach $word ( @words ){
	    if(exists($dutch2english{$word})){
		$trans = $dutch2english{$word};
		$trans =~ s/_\d+//; # strip ambiguity marker
		print "$trans ";
	    }
	    elsif(exists($english2dutch{$word})){
		$trans = $english2dutch{$word};
		$trans =~ s/_\d+//; 
		print "$trans ";
	    }
	    else{

		# look up whether it's not one of these
		#
		if(exists($ambiword_dutch{$word})){

		    # resolve the ambiguity if possible
		    if(exists($ambiword_dutch{$word}{$words[$position+1]})){
			$trans = $ambiword_dutch{$word}{$words[$position+1]};
			print "$trans ";
		    }
		    else{
			$trans = $ambiword_dutch{$word}{"DEFAULT"};
			print "$trans ";
		    }
		}
		# or one of those
		elsif(exists($ambiword_english{$word})){

		    # resolve the ambiguity if possible
		    if(exists($ambiword_english{$word}{$words[$position+1]})){
			$trans = $ambiword_english{$word}{$words[$position+1]};
			print "$trans ";
		    }
		    else{
			$trans = $ambiword_english{$word}{"DEFAULT"};
			print "$trans ";
		    }		    
		}
		else{
		    print "[$word] "; # unknown
		}
	    }
	    $position++;
	}
	print "\n";
    }
}while($line =~ /\w/);



Exercise 5.3

# exercise5.3.pl: 
# 2000-03-10 zavrel@uia.ua.ac.be

print "Give some input text:";

while(defined($line = <>)){

    chomp $line;
    $line = lc($line);
    $line =~ s/^\s*//;
    $line =~ s/\s*$//;

    @wordline = split /\s+/, $line;
    foreach $word (@wordline){
	
	@letterline = split //, $word;
    
	for($i=0; $i <= $#letterline; $i++){
	    $unigramfrequency{$letterline[$i]}++;
	    if($i < $#letterline ){
		$bigram = $letterline[$i] . $letterline[$i+1];
		$bigramfrequency{$bigram}++;
	    }
	}
    }
}

print "Unigrams frequencies:\n";
foreach $key (sort { $unigramfrequency{$b} <=> $unigramfrequency{$a} } keys %unigramfrequency){
    print "$key $unigramfrequency{$key}\n";
}
print "Bigrams frequencies:\n";
foreach $key (sort { $bigramfrequency{$b} <=> $bigramfrequency{$a} }keys %bigramfrequency){
    print "$key $bigramfrequency{$key}\n";
}

Exercise 5.4*

# Exercise 5.4 -- bigram language mode
#
# usage:	perl exc54.pl textfile
#
# (c) Gert Durieux


# read textfile from command line
while (<>) {
	@line = map lc, split //, $_;
	for ($i = 0; $i < @line; $i++) {

		# ignore spaces
		if ($line[$i] =~ /\S/) {
			$unigram = $line[$i];
			$unigrams{$unigram}++;
			if ($line[$i+1] =~ /\S/) {
				$bigram = $line[$i] . $line[$i+1];
				$bigrams{$bigram}++;
			}
		}

		# but keep track of word beginnings
		if ($i+1 < @line) {
			$bigram = $line[$i] . $line[$i+1];
			if ($bigram =~ /^.\b/) {
				$unigram = "^";
				$unigrams{$unigram}++;
				$bigram = "^" . $line[$i+1];
				$bigrams{$bigram}++;
			}
		}

	}
}

# process user input
print "Enter some words, line by line. Finish with ^D\n> ";
while (<>) {
	$line = $_;
	$line =~ s/^\s+//;
	$line =~ s/\s+$//;

	@words = split /\s+/, $line;
	foreach $word (@words) {

		# split words into letters and add BOS
		@letters = map lc, split //, $word;
		@letters = ("^", @letters);

		$probability = 1;
		for ($i = 0; $i < $#letters; $i++) {

			$unigram = $letters[$i];
			$bigram = $letters[$i] . $letters[$i+1];

			# smooth by adding 1 to each uni- and bigram freq
			if (exists($unigrams{$unigram})) {
				$ufreq = $unigrams{$unigram} + 1;
			} else {
				$ufreq = 1;
			}

			if (exists($bigrams{$bigram})) {
				$bfreq = $bigrams{$bigram} + 1;
			} else {
				$bfreq = 1;
			}

			$probability *= $bfreq / $ufreq;
		}
		print "$probability\n> ";
	}
}

print "\n";
exit(0);


Previous | Home | Lecture Notes | Exercises | Next
Last update: March 10, 2000. zavrel@uia.ua.ac.be