previous class main page next class

Språkstatistik HT95:11

Class:   11
Date:    951005
Topic:   Practical Exercise 1

Practical Exercise 1

In this practical exercise you will apply the noisy channel model for correcting a corrupted text. The goals of this exercise are learning to apply the noisy channel model and learning to write useful little UNIX scripts. You will have to make a little report about this practical exercise and send it by e-mail to erikt@strindberg.ling.uu.se You can send me your report until Monday October 23. Reports which are sent in after that date will receive a half point penalty per extra day.

In this exercise we will mess up a text and attempt to correct it with a unigram model. The text we will use is the second chapter of Alice's Adventures in Wonderland by Lewis Carroll (obtained from Project Runeberg). By the way, you can find all the text and programs I describe here in the directory /usr/users/staff/erikt/P/ss95/pex1

You can print this exercise by choosing the print option from your web browser while viewing this text.

  1. We will start with the program that messes up the text: messUp. This is a shell script. Examine it and try to understand the commands that are used in the program.

    Assingment 1 for your report is: Find out which noisy channel model (error model) the program is using for creating corrupted text.

  2. Now make a sub-directory in your home directory for this exercise (mkdir) and copy all the files of /usr/users/staff/erikt/P/ss95/pex1 to this directory (cp). You might want to disallow access to this directory for other people as you probably do not want them to borrow your exercise results (chmod 700 directory). Now apply messUp to the second chapter of Alice which you can find in the file alice30.ch2.

    Assignment 2 for your report is to list the first sentence of your corrupted version of the chapter (from `Curiouser to English);) and show which characters are wrong.

  3. Now we will look at a second shell script: count. This script takes two texts as input and counts the characters that are different. However when you apply count to the original and the corrupted version of your text you will find out that the error percentage will be lower than the 50% we should expect from the error model.

    Assignment 3 for your report is to modify count in such a way that it only takes into account the characters which can be corrupted in our channel. After modification the output of the script should be something like 50%. Put a copy of the modified count script in your report. Furthermore, list both the original error rate as the error rate of the modified script in your report.

  4. The fourth assignment will be to make a unigram message model of your original text. For this purpose you can use the complete story of Alice's Adventures in Wonderland which you will find in the file: alice30.txt. Use the unigramModel script for making the unigram model. In your report you have to list the values of message model that are important for correcting texts produced by our channel.

  5. Make a unigram corrector script based on our channel model and our message model. List the script in your report. HINT: You can start with the messUp script. You will only have to make some small modifications to it in order to make it work like a unigram corrector.

  6. Apply your unigram corrector to the corrupted text. List the error percentage of the resulting text in your report. Also present the first sentence of this improved text (from `Curiouser to English);) and show which characters are wrong.

  7. The final assignment is optional. You do not have to do it in order to pass this exercise. In this assignment you attempt to make a bigram corrector for your corrupted text. You will have to repeat steps four, five and six and produce a bigram message model and a bigram corrector. This will involve making modifications to the program addNewlines.c which is written in C. If you need any help with that please consult me. For this optional assignment, you will produce the same items for your report as in steps four, five and six applied to a bigram corrector.

Send your reports to erikt@strindberg.ling.uu.se until Monday October 23. If you have any questions please ask me.


Last update: October 17, 1995. erikt@strindberg.ling.uu.se