Språkstatistik HT97:18

Practical exercise 2

This is the second practical exercise in the course Språkstatistik HT97. It deals with the application of Perl scripts for the correction of corrupted texts.

Assignments: 2.1 | 2.2 | 2.3 | 2.4 | 2.5

Deadline

The deadline for handing in reports for this exercise is Tuesday October 28, 1997. Late reports will receive one penalty point per extra day.

General

In this practical exercise you will apply the noisy channel model for correcting a corrupted text. The exercise contains five assignments. There are no optional assignments. Write a report about the assignments under the same conditions as in the first practical exercise.

In this second exercise you will work with a programming language called Perl. Your task will be to examine Perl programs and make some small changes in them. You do not have to understand the programs completely. Try to concentrate on the parts that are important for your assignments.

You will process a text with a program that simulates a noisy channel. The program will generate a text that contains errors You will attempt to correct this text by using a unigram model. The text we will use is the first file of the Press65 corpus: /corpora/Press65/UnixAscii/p65.001 . The programs can be found in the directory /home/staff/erikt/P/ss97/pex2

TIP: The important parts in the Perl programs used in this exercise are accompanied by comments starting with three # characters. Skip the rest of the programs.

Assignments: 2.1 | 2.2 | 2.3 | 2.4 | 2.5

Assignment 2.1

There is a Perl script available that simulates a noisy channel model. The script can be found in the file /home/staff/erikt/P/ss97/pex2/messUp . Examine the script. Your first assignment is to describe a channel model for the script.

NOTE: The programs used in this exercise may take some time to complete (between one and two minutes). Please be patient.

Assignments: 2.1 | 2.2 | 2.3 | 2.4 | 2.5

Assignment 2.2

Now create a directory for the files of this exercise (mkdir command) and copy all the files of /home/staff/erikt/P/ss97/pex2 to this new directory (cp * yourDirectory). You might want to disallow access to this directory for other people as you probably do not want them to borrow your exercise results (chmod 700 yourDirectory).

Apply messUp to the first file of the Press65 corpus which you can find in the file /corpora/Press65/UnixAscii/p65.001 and generate a corrupted version of this file (./messUp < /corpora/Press65/UnixAscii/p65.001 > yourFile). List the first sentence of the corrupted version in your report (from James Broom to kraft'.+) and show which characters are wrong.

TIP: For expressing "a is an element of set {a,b,c}" in Perl you can use "a =~ /[abc]/".

Assignments: 2.1 | 2.2 | 2.3 | 2.4 | 2.5

Assignment 2.3

The shell script count takes two texts as input and counts the characters that are different. Apply this script to your corrupted version of the Press65 file and the original version. What is the percentage correct characters that you measure?

In the channel model that you have developed in assignment 2.1 you will probably have used the percentage 50%. Yet count will measure percentage correct characters that is much higher than 50%. Can you explain the difference?

Describe what should be changed in the Perl part in the count script so that the script only will compute a percentage for the characters that could have been changed by messUp. Try to change the script to achieve this behavior. If you have problems with the programming language then ask the teacher for help.

After having changed the script it should now output a percentage correct of about 50% when you compare the corrupted text with the original one. What percentage does it report for your corrupted file?

Assignments: 2.1 | 2.2 | 2.3 | 2.4 | 2.5

Assignment 2.4

Create a unigram language model for your text. List the frequency values of language model that are important for correcting texts produced by messUp in your report.

In this case a unigram model is nothing else than a character frequency list. You already have created a frequency list for words in assignment 1.1. You can use the commands of the counting part of that assignment if you manage to put every character on a separate line. Commands for getting every character on a separate line can be found in the count script.

Assignments: 2.1 | 2.2 | 2.3 | 2.4 | 2.5

Assignment 2.5

Make a unigram corrector based on the channel model of assignment 2.1 and the unigram language model of assignment 2.4. Apply your unigram corrector to the corrupted text. List the percentage correct characters of the resulting text in your report. Use the modified version of count for computing the percentage. Also present the first sentence of this improved text (from James Broom to kraft'.+) and show which characters are wrong.

A unigram corrector for characters is a program that replaces each character with the most frequent character in its class. You can create a unigram corrector by modifying the messUp script or by using the command tr. The unigram corrector will be problem-specific; it does not need to compute the unigram language model.

Extra optional assignment for people that like programming: Make a bigram corrector and apply it at your corrupted text. This comes down to repeating exercise 2.4 and 2.5 for bigrams. In 2.5 you need to modify the messUp script. You may assume that you can obtain a good bigram correction without looking at character groups larger than two characters. So you don't have to build a Hidden Markov Model.

TIP: If you want to write a nicely formatted report then try using LaTeX. Start with this example report file. Save it in your directory and change it accoding to your needs. After processing the example file it looked like this.

Last update: November 05, 1997. erikt@stp.ling.uu.se