Språkstatistik HT95:11
Class: 11
Date: 951005
Topic: Practical Exercise 1
Practical Exercise 1
In this practical exercise you will apply the noisy channel model for
correcting a corrupted text.
The goals of this exercise are learning to apply the noisy channel
model and learning to write useful little UNIX scripts.
You will have to make a little report about this practical exercise
and send it by e-mail to erikt@strindberg.ling.uu.se
You can send me your report until Monday October 23.
Reports which are sent in after that date will receive a half point
penalty per extra day.
In this exercise we will mess up a text and attempt to correct it
with a unigram model.
The text we will use is the second chapter of Alice's Adventures in
Wonderland by Lewis Carroll
(obtained from
Project
Runeberg).
By the way, you can find all the text and programs I describe here in
the directory
/usr/users/staff/erikt/P/ss95/pex1
You can print this exercise by choosing the print option from your
web browser while viewing this text.
-
We will start with the program that messes up the text:
messUp.
This is a shell script.
Examine it and try to understand the commands that are used in the
program.
Assingment 1 for your report is: Find out which noisy channel model
(error model) the program is using for creating corrupted text.
-
Now make a sub-directory in your home directory for this exercise
(mkdir) and copy all the files of
/usr/users/staff/erikt/P/ss95/pex1
to this directory (cp).
You might want to disallow access to this directory for other people
as you probably do not want them to borrow your exercise results
(chmod 700 directory).
Now apply messUp to the second chapter of Alice which you can
find in the file
alice30.ch2.
Assignment 2 for your report is to list the first sentence of your
corrupted version of the chapter
(from `Curiouser to English);) and show which
characters are wrong.
-
Now we will look at a second shell script:
count.
This script takes two texts as input and counts the characters that
are different.
However when you apply count to the original and the
corrupted version of your text you will find out that the error
percentage will be lower than the 50% we should expect from the
error model.
Assignment 3 for your report is to modify count in such a way
that it only takes into account the characters which can be corrupted
in our channel.
After modification the output of the script should be something like
50%.
Put a copy of the modified count script in your report.
Furthermore, list both the original error rate as the error rate of
the modified script in your report.
-
The fourth assignment will be to make a unigram message model
of your original text.
For this purpose you can use the complete story of Alice's Adventures
in Wonderland which you will find in the file:
alice30.txt.
Use the
unigramModel script for
making the unigram model.
In your report you have to list the values of message model that
are important for correcting texts produced by our channel.
-
Make a unigram corrector script based on our channel model and our
message model.
List the script in your report.
HINT:
You can start with the messUp script.
You will only have to make some small modifications to it in order to make
it work like a unigram corrector.
-
Apply your unigram corrector to the corrupted text.
List the error percentage of the resulting text in your report.
Also present the first sentence of this improved text
(from `Curiouser to English);) and show which
characters are wrong.
-
The final assignment is optional.
You do not have to do it in order to pass this exercise.
In this assignment you attempt to make a bigram corrector for your
corrupted text.
You will have to repeat steps four, five and six and produce a
bigram message model and a bigram corrector.
This will involve making modifications to the program
addNewlines.c which is written in C.
If you need any help with that please consult me.
For this optional assignment, you will produce the same items for your
report as in steps four, five and six applied to a bigram corrector.
Send your reports to erikt@strindberg.ling.uu.se until Monday October
23.
If you have any questions please ask me.
Last update: October 17, 1995.
erikt@strindberg.ling.uu.se