Previous | Home | Lecture Notes | Solutions | Next


Perl Exercises (9)

These exercises are part of a Perl course taught at CNTS - Language Technology Group at the University of Antwerp.

If you are a course participant and you want to submit your answers, please send the solutions to these exercises to before or on Wednesday April 5, 2000. Note that only the first three exercises are obligatory. When you submit your results, please include the Perl code you have written, the result of at least one test and the answers to the questions mentioned in the exercise, if there are any.

Exercise 9.1

All the programs you construct for these exercises must contain the line use strict at the top.

Write a non-empty program that prints itself. You may assume that the program knows the name of the file it is stored in but you may not use system. Unfortunately we cannot give an example of the required output for this exercise.

Exercise 9.2

Write a program that reads a file and counts how many characters, words and lines are in the file. It should count all characters, including newline characters, and all lines, including empty lines. Words are strings containing at least one of the characters a-zA-Z. Write your own code for tokenization or use the code posted as an answer for exercise 3.5*. Your program should be applied to the first three chapters of the novel Oliver Twist by Charles Dickens. You can find a temporary copy of this text in the file oliver.txt at:

Download this text to your own system. It was obtained from Project Gutenberg, an online database for electronic text. Here is an example of the output of the program:

   Found 100 lines, 1000 words and 10000 characters.

Exercise 9.3

Write a program that reads a file and stores a frequency list of the words in the file freq.txt. The frequency list should contain the number of times a word occurs in front of each word and it should be sorted with the most frequent words appearing in the top of the list. Just like in the previous exercise a word should contain at least one character of the range a-zA-Z. Here is an output example:

   134 the
   89 a
   87 in

Your program should be applied to the file oliver.txt used in exercise 9.2. Mention the ten most frequent words of this file with their frequencies in the results you submit for this exercise. How many different words did your program find?

Exercise 9.4*

This is a starred exercise which means that you may skip the exercise. Make this exercise only if you think it is interesting and you have some time left.

The file train10.txt contains 1012 tokens of the first chapter of Oliver Twist together with their part-of-speech tag. Write a program that uses this file to create a unigram model of part-of-speech tag assignment and stores this model in the file model.txt.

Background information: a unigram model for part-of-speech tag assignment defines what part-of-speech tag has most often been assigned to a word in the training file. You program should count how often a tag has been assigned to a word and output the word together with the most frequently assigned tag. In case two or more tags occur with a word at the same frequency, the program should choose the tag which occurs most frequently with any word. If even that criterion fails to provide a winner, any of the remaining tags may be chosen. Example of a part of the output:

   have VBP
   the DT
   there EX

Exercise 9.5*

The file test10.txt contains a text with 303 tokens of the first chapter of Oliver Twist together with their part-of-speech tag. Write a program that uses the unigram model generated in the previous exercise for tagging this text. The program should assign the tag mentioned in the unigram model to the corresponding word in the text. If the model does not contain an entry for a word in the text, the program should assign the most frequent tag in the model to the word. How many tags did your program get right?

Note: the tags present in the training and test files used in the last two exercises have been generated automatically and have not been checked. This means that they may contain some errors.

When you have completed exercises 9.4* and 9.5*, you have created your own part-of-speech tagger.

Previous | Home | Lecture Notes | Solutions | Next
Last update: March 31, 2000.