Previous | Home | Lecture Notes | Exercises

 

Perl Solutions (9)


These exercise solutions are part of a Perl course taught at CNTS - Language Technology Group at the University of Antwerp.

The programs corresponding with these exercises can be found in the appendix.

Exercise 9.1

Write a non-empty program that prints itself.

The program is stored in a file. It reads this file from disk, line by line, and prints each line. I have tested the program and it worked: it printed itself.

Exercise 9.2

Write a program that reads a file and counts how many characters, words and lines are in the file.

The program reads the file, line by line, and increments the three count variables for each line. The number of words is determined by using the tokenize code from exercise 8.5*. Output of the program for oliver.txt:

   Found 1000 lines, 8198 words and 47742 characters.

Exercise 9.3

Write a program that reads a file and stores a frequency list of the words in the file freq.txt.

The program reads the file and selects words just like in the previous exercise. Words are stored in a hash as keys and the values in the hash are numbers which indicate how often the words were found. We need to sort the hash but we only have a sort routine for lists. The solution is to store the hash keys in a list and sort this list on the hash values. When the list is sorted, the program prints the hash in the order determined by the list. Here are the first ten lines of the file freq.txt after processing oliver.txt:

   488 the
   242 and
   239 a
   233 of
   206 to
   146 in
   139 was
   104 his
   97 he
   86 that

The input text contains 2188 different words.

Exercise 9.4*

Write a program that uses the file train10.txt to create a unigram model of part-of-speech tag assignment and stores this model in the file model.txt.

My program reads the file and stores the word-tag pairs as keys in a hash with the number of occurrences as values. For each word it counts how may options there are for part-of-speech tag assignment. In the output part, the program recognizes two cases. First, only one tag can be assigned to a word. If that is the case, it puts the word-tag pair in the output file. Second, more than one tag can be assigned to the word. Now the remaining pairs have to be checked to find the most frequently occurring tag assignment for the word.

There was only one ambiguous word in train10.txt: have which could receive tag VBP (5 times) or VB (2). The file contains 440 different words.

Exercise 9.5*

Write a program that uses the unigram model generated in the previous exercise for tagging the text in test10.txt.

The program reads the model generated in the previous exercise and stores it in a hash. It counts the tags and selects the most frequent one as the one to be assigned to unseen words. Then it reads the text, assigns tags to words and counts how many were correct. This is the output for file test10.txt:

   Seen 303 words; correct 217 (71.6%)

The program does not perform very well. The reason for this is that the model is small (440 words). Therefore there are many words in the text do not occur in the model. If we remove these new words from the text, the performance of the program raises to 97.9%.

Appendix

Exercise 9.1

# exercise91: print this file
# usage: exercise91
# 2000-04-03 erikt@uia.ua.ac.be

use strict;
open(INFILE,"exercise91") or die("cannot open input file");
while (<INFILE>) { print; }
close(INFILE);
exit(0);

Exercise 9.2

# exercise92: count characters, words and lines in one file
# usage: exercise92
# 2000-04-03 erikt@uia.ua.ac.be

use strict;
my $chars = 0;
my $words = 0;
my $lines = 0;
my ($token,@tokens);
open(INFILE,"oliver.txt") or die("cannot open input file");
while (<INFILE>) { 
   $chars += length;
   @tokens = &tokenize($_);
   foreach $token (@tokens) {
      if ($token =~ /[a-zA-Z]/) { $words++; }
   }
   $lines++;
}
close(INFILE);
print "Found $lines lines, $words words and $chars characters.\n";
exit(0);

# tokenize subroutine originates from exercise 8.5*

Exercise 9.3

# exercise93: make word frequency list and store it in file freq.txt
# usage: exercise93
# 2000-04-03 erikt@uia.ua.ac.be

use strict;
my (%count,$i,$key,@keys,$token,@tokens);
open(INFILE,"oliver.txt") or die("cannot open input file");
while (<INFILE>) { 
   @tokens = &tokenize($_);
   foreach $token (@tokens) {
      if ($token =~ /[a-zA-Z]/) { 
         # the question mark operator works as follows:
         # A ? B : C
         # if A is true then it will return B and else it returns C
         # Here it will increment $count{$token} if it exists and
         # if it doesn't, it will fill it with 1
         $count{$token} = $count{$token} ? $count{$token}+1 : 1;
      }
   }
}
close(INFILE);

# we need to sort a hash but we can only sort lists 
# so put hash keys in list and sort the list
@keys = keys %count;
@keys = sort { $count{$b} <=> $count{$a} } @keys;
open(OUTFILE,">freq.txt") or die("cannot open output file");
for ($i=0;$i<=$#keys;$i++) { 
   print OUTFILE "$count{$keys[$i]} $keys[$i]\n";
}
close(OUTFILE);  

exit(0);

# tokenize subroutine originates from exercise 8.5*

Exercise 9.4*

# exercise94: make unigram part-of-speech assignment model
# usage: exercise94
# 2000-04-03 erikt@uia.ua.ac.be

use strict;
my ($bestCount,$bestTag,%count,$key1,$key2,%options,
    $tag1,$tag2,%tagFreq,$word1,$word2);

# read training file
open(INFILE,"train10.txt") or die("cannot open input file");
while (<INFILE>) {
   chomp($_);
   ($word1,$tag1) = split(/ /);
   if (not $count{$_}) {
      # question operator is explained in previous exercise
      $options{$word1} = $options{$word1} ? $options{$word1}+1 : 1;
   }
   $count{$_} = $count{$_} ? $count{$_}+1 : 1;
   $tagFreq{$tag1} = $tagFreq{$tag1} ? $tagFreq{$tag1}+1 : 1;
}
close(INFILE);

# generate output
open(OUTFILE,">model.txt") or die("cannot open output file");
# easy cases: only one possible tag
foreach $key1 (keys %count) {
   ($word1,$tag1) = split(/ /,$key1);
   if ($options{$word1} == 1) {
      print OUTFILE "$key1\n";
      delete($count{$key1});
   }
}
# hard cases: more than one possible tag
foreach $key1 (keys %count) {
   if ($count{$key1} > 0) {
      ($word1,$tag1) = split(/ /,$key1);
      $bestTag = $tag1;
      $bestCount = $count{$key1};
      $count{$key1} = 0;
      foreach $key2 (keys %count) {
         ($word2,$tag2) = split(/ /,$key2);
         if ($word2 eq $word1) {
            if ($count{$key2} > $bestCount or
                ($count{$key2} == $bestCount and 
                 $tagFreq{$tag2} > $tagFreq{$bestTag})) {
               $bestTag = $tag2;
               $bestCount = $count{$key2};
            }
            $count{$key2} = 0;
         }
      }
      print OUTFILE "$word1 $bestTag\n";
   }      
}
close(OUTFILE);
exit(0);

Exercise 9.5*

# exercise95: make unigram part-of-speech assignment model
# usage: exercise95
# 2000-04-03 erikt@uia.ua.ac.be

use strict;
my ($bestCount,$bestTag,$correct,%model,$newTag,$seen,
    $tag,%tagFreq,$word);

# read model
open(INFILE,"model.txt") or die("cannot open model file");
while (<INFILE>) {
   chomp($_);   
   ($word,$tag) = split(/ /);
   $model{$word} = $tag;
   $tagFreq{$tag} = $tagFreq{$tag} ? $tagFreq{$tag}+1 : 1;
}
close(INFILE);

# find most frequent tag (for unseen words)
$bestCount = -1;
foreach $tag (keys %tagFreq) {
   if ($tagFreq{$tag} > $bestCount) {
      $bestCount = $tagFreq{$tag};
      $bestTag = $tag;
   }
}

# process input text
open(INFILE,"test10.txt") or die("cannot open input file");
$seen = 0;
$correct = 0;
while (<INFILE>) {
   chomp($_);
   ($word,$tag) = split(/ /);
   $newTag = $model{$word} ? $model{$word} : $bestTag;
   if ($newTag eq $tag) { $correct++; }
   $seen++;
}
close(INFILE);
printf "Seen $seen words; correct $correct (%4.1f%%)\n",
   100*$correct/$seen;

exit(0);


Previous | Home | Lecture Notes | Exercises
Last update: April 06, 2000. erikt@uia.ua.ac.be