Previous | Home | Lecture Notes | Exercises | Next

Perl Solutions (3)

These exercise solutions are part of a Perl course taught at CNTS - Computational Linguistics at the University of Antwerp.

The programs corresponding with these exercises can be found in the appendix.

Exercise 3.1

Write a program that reads a texts and outputs the rot13 equivalent.

This requires a program that reads lines, stores them in a variable and replaces the characters with tr///. Here is a test example with the encrypted text of the example in the assignment:

   Enter a text, line by line. Finish with an empty line.
   > Guvf grkg fubhyq or vzcbffvoyr gb ernq
   > orpnhfr vg unf orra rapelcgrq jvgu n svar
   > rapelcgvba zrgubq.
   > 
   This text should be impossible to read
   because it has been encrypted with a fine
   encryption method.

Exercise 3.2

Write a program that translates English text to text in some other language. It should at least be able to generate something reasonable for the words the, man, woman, boy, girl, telescope, sees, walks and with regardless of the context in which these words appear.

My program performs a word by word translation with one substitute operation per word. Two extra substitutions were possible because the can have more than one gender value in the target language. Here is an example run:

   Enter a text, line by line. Finish with an empty line.
   > The man sees the boy with the telescope.
   > The woman walks with the girl.
   > 
   el hombre ve el chico con el telescopio.
   la mujer va con la chica.

Exercise 3.3

Make regular expressions for the following string descriptions: 1. Strings containing only a's and b's, 2. Strings that do not contain white space, 3. Strings with exactly one word regardless of white space, 4. Strings that end with the same character they start with, 5. Like 1. but the number of a's should be even, 6. Any string, and 7. No string, not even the empty string.

I solved the fifth by requiring that the string consists of zero or more occurrences of (b*ab*ab*) and nothing else. This makes sure that the number of a's is even while any number of b's can appear in between them. Since every string contains zero or more a's, the expression a* was an appropriate solution for the sixth question. Finally, any character is either alphanumeric or not an alphanumeric so [\w\W] matches every character and this means that [^\w\W] is a regular expression that will never match. Here are the results of a test run:

   "" matches expressions: 1 2 5 6 
   "a" matches expressions: 1 2 3 4 6 
   "ab" matches expressions: 1 2 3 6 
   "bab" matches expressions: 1 2 3 4 6 
   "aba" matches expressions: 1 2 3 4 5 6 
   "c" matches expressions: 2 3 4 6 
   "cd" matches expressions: 2 3 6 
   " cd" matches expressions: 3 6 
   "  cd " matches expressions: 3 4 6 
   "the man" matches expressions: 6

Exercise 3.4*

Write a program that asks ten questions, reads the answers and shows the number of answers that were right. The questions should be like: Please enter n things in which n is an arbitrary number from 1 to 10 and things is one of the three words dollars, stars or commas. The questions should be chosen arbitrarily by the program.

I would have liked the program to have performed a test like /$char{$stringLength}/ but for some reason this did not work. It complained about an uninitialized variable (note: $x{y} is the notation for a hashed list). So the program computes the answer string itself and compares it with the input with ne. Here is a test run:

    1. Please enter 8 dollars: $$$$$$$$
       Correct!
    2. Please enter 9 dollars: $$
       Wrong.
    3. Please enter 4 commas: ,,,,
       Correct!
    4. Please enter 1 star: *
       Correct!
    5. Please enter 10 stars: **
       Wrong.
    6. Please enter 6 dollars: ,,,,,,
       Wrong.
    7. Please enter 1 dollar: 
       Wrong.
    8. Please enter 8 commas: ,,,,,,,,
       Correct!
    9. Please enter 10 stars: **********
       Correct!
   10. Please enter 8 dollars: $$$$$$$$
       Correct!
   You have 6 correct answers on 10 questions.

Exercise 3.5*

Write a word tokenizer: a program which reads a text and outputs its words in the order that they appear in the text.

This is an open-ended exercise. Work on it can continue indefinitely so I had to choose to stop somewhere. I have taken a slightly different approach than shown in the example in the assignment: my program keeps the punctuation marks but puts them on separate lines. I have used the tokenizing method of the Wall Street Journal corpus as an example. This means that I have also regarded n't, 're and 's as separate tokens. The program performs reasonably: I have tested it on a text of 3500 words and it made only three errors:

Abbreviations like Messrs loose their trailing period. Solving this is easy but the solution works only for that abbreviation (see program for an example for Mr).
Three periods behind each other will be put on separate lines. However, it is better to put them together in one token. This error was made twice.

Appendix