Språkstatistik HT96:19

Språkstatistik HT96:19

In this practical exercise you will apply the noisy channel model for correcting a corrupted text. The exercise contains six assignments and you can make them under the same conditions as in the first practical exercise. There are five compulsory assignments and one optional one. People that do the first five assignments can get a mark between zero and nine and people that make all assignments can get a mark between zero and ten.

In this exercise we will mess up a text and attempt to correct it with a unigram model. The text we will use is the tenth chapter of De lycksaliges ö by August Strindberg which was published in Svenska öden och äfventyr (obtained from Project Runeberg). You can find all the texts and programs for this exercise in the directory /usr/users/staff/erikt/P/ss96/pex2

Assignment 2.1

There is a Perl script available for messing up text. The script can be found in the file /usr/users/staff/erikt/P/ss96/pex2/messUp Copy the script to one of your directories and examine it. Your first assignment is to find out what error model the script is using for messing up text.

Assignment 2.2

Now make a subdirectory in your home directory for this exercise (mkdir command) and copy all the files of /usr/users/staff/erikt/P/ss96/pex2 to this directory (cp * yourDirectory). You might want to disallow access to this directory for other people as you probably do not want them to borrow your exercise results (chmod 700 yourDirectory). Now apply messUp to chapter 10 of Strindberg's De lycksaliges ö which you can find in the file strindberg.txt (messUp < strindberg.txt)). List the first sentence of your corrupted version in your report (from Sonen, som to förutsättningars skull) and show which characters are wrong.

Important! If you work in a pair then generate one corrupted file per person. Give these files names you can remember because you will need them in the rest of the exercise. Everyone will have to present his own data in his report.

Assignment 2.3

The shell script count takes two texts as input and counts the characters that are different. Apply this script to your corrupted version of the Strindberg chapter and compare it with the correct version. What is the error percentage that you have measured?

In the error model that you have developed in assignment 2.1 you will probably have used the percentage 50%. Yet count will measure an error percentage that is much lower than 50%. Can you explain the difference?

Change the Perl part in the count script in such a way that the script only computes an error percentage for the characters that could have been changed by messUp. The script should now output an error percentage of about 50% when you compare the corrupted text with the original one. What error percentage does it report for your corrupted file? Hints

Assignment 2.4

The fourth assignment will be to make a unigram message model of your original text. For this purpose you can use the complete text of Strindberg's Svenska öden och äfventyr which you will find in the file: afventyr.txt. In your report you have to list the values of language model that are important for correcting texts produced by messUp. Hint

Assignment 2.5

Make a unigram corrector script based on our channel model and our message model. Hint

Apply your unigram corrector to the corrupted text. List the error percentage of the resulting text in your report. Also present the first sentence of this improved text (from Sonen, som to förutsättningars skull) and show which characters are wrong.

Assignment 2.6

The sixth assignment is optional. You do not have to do it in order to pass this exercise.

In this assignment you attempt to make a bigram corrector for your corrupted text. You will have to repeat assignments four and five and produce a bigram message model and a bigram corrector. This optional assignment requires the answers on the same questions as in assignments four and five but applied to bigrams instead of unigrams. Hints

Last update: October 17, 1996. erikt@stp.ling.uu.se