This is the third and the final practical exercise in the course Språkstatistik HT97. It deals with the correction of corrupted texts by using stochastic context free grammars.
The deadline for handing in reports for this exercise is Tuesday November 11, 1997. Late reports will receive one penalty point per extra day.
In this practical exercise we will compare the results of a unigram corrector with a corrector based on a stochastic context-free grammar. We will apply these two algorithms to corrupted text that was originally generated with the bc grammar presented in Liberman&Schabes 1993.
You can find all the grammar and programs for this exercise in the directory /home/staff/erikt/P/ss97/pex3 Copy all these files to one of your own directories.
This practical exercise does not contain optional parts.
The grammar that we will use in this exercise can be found in the file grammarA. Use a text editor like emacs for changing this grammar in a grammar that generates strings which may contain any of the characters [, +, and - instead of the left square bracket with approximately equal probability. Save the new grammar in a file grammarB and list the modified part in your report.
Use the program parse for parsing the expression f+x+f] with the corrupt grammar. How many parses did parse generate? What is the most probable one? What rules were used in the most probable parse?
The program parse parses expressions and returns the most probable one. It can be started with different command line options:
Using the -g option is obligatory but otherwise any combination of the options can be used. Start the program as ./parse -g grammarB -a -s. It will display its lexicon and the grammar it is using. The lexicon and grammar are obtained from the file grammarB. Now type in the string f+x+f]. The program will display all possible parses of the string with line format: string, probability, parse result and rules used. Use this output for answering the questions for this assignment. After this you can enter more strings or stop the program by typing control-D or control-C.
Use the program generate program for generating a text of 1000 correct strings. Save the result in a file. List the first ten strings in your report.
The program generate can be used for generating strings. This program can be invoked with different command line options:
Using the -g option is obligatory but otherwise any combination of the options can be used.
Generate a version of the file with 1000 strings which is corrupted according to the error model you have used in assignment 3.1. List 10 strings with errors in your report. Compute the number of incorrect characters and the percentage of correct characters in the corrupted version.
The corrupted version can be generated with the messUp program of exercise 2. Change the program in such a way that it generates corrupted text according to the error model you have used for creating grammar B in assignment 3.1. Apply the program to the file with 1000 strings and check its output to make sure that it works as you expect. Note: Use the messUp version of 971022.
For finding out which strings are incorrect in the corrupted version you can use the Unix command diff. This command processes two files which are almost the same. It will show what parts each file differs from the other. Run the command as diff correctFile corruptedFile. In the output the lines starting with < appear only in correctFile and the lines starting with > appear only in corruptedFile.
Counting the number of incorrect characters and the percentage of correct characters can be done with the count program of exercise 2. Adapt this program to this exercise in the same way as you have done in assignment 2.3. Be aware of the fact that almost all the characters you are working with in this assignment have a special meaning for Perl. It is best to put a backslash \ before all these characters in the Perl program. Note: The count program will give a percentage correct which is higher than 33%.
Create a problem specific unigram corrector like in assignment 2.5 and correct your corrupted text with this corrector. What percentage correct characters do you obtain?
Correct the corrupted text with the program parse. What percentage correct characters do you obtain? List all the remaining strings with errors in your report.
The parse program has an option for not expanding a nonterminal in its results. For example: if we parse the string f-x] with grammar B then the standard parse result (field 3 in the output) will be f-x]. This is a sequence of all the tokens in the leaves of the parse tree. However we can also run parse with the option -n LSQ. Then LSQ nodes will not be expanded and the parse result becomes fLSQx].
We will use the -n option of parse for correcting the corrupted version of our text. If we run parse with the options -n LSQ and -g grammarB then it will parse all strings and show the best parse without expanding the LSQ nonterminal. If we extract the parsed string (field 3) from the result and replace all occurrences of LSQ by the correct character then we have obtained a corrected version of the file.