Språkstatistik HT97:21

Practical exercise 3

This is the third and the final practical exercise in the course Språkstatistik HT97. It deals with the correction of corrupted texts by using stochastic context free grammars.

Assignments: 3.1 | 3.2 | 3.3 | 3.4 | 3.5 | 3.6

Deadline

The deadline for handing in reports for this exercise is Tuesday November 11, 1997. Late reports will receive one penalty point per extra day.

General

In this practical exercise we will compare the results of a unigram corrector with a corrector based on a stochastic context-free grammar. We will apply these two algorithms to corrupted text that was originally generated with the bc grammar presented in Liberman&Schabes 1993.

You can find all the grammar and programs for this exercise in the directory /home/staff/erikt/P/ss97/pex3 Copy all these files to one of your own directories.

This practical exercise does not contain optional parts.

Assignments: 3.1 | 3.2 | 3.3 | 3.4 | 3.5 | 3.6

Assignment 3.1

The grammar that we will use in this exercise can be found in the file grammarA. Use a text editor like emacs for changing this grammar in a grammar that generates strings which may contain any of the characters [, +, and - instead of the left square bracket with approximately equal probability. Save the new grammar in a file grammarB and list the modified part in your report.

TIP: When you want to run a program xxx that is in the current directory then you should start it as ./xxx
You can get around of having to specify the ./ sequence after having issued the command: PATH="$PATH:."

Assignments: 3.1 | 3.2 | 3.3 | 3.4 | 3.5 | 3.6

Assignment 3.2

Use the program parse for parsing the expression f+x+f] with the corrupt grammar. How many parses did parse generate? What is the most probable one? What rules were used in the most probable parse?

Background information on assignment 3.2

The program parse parses expressions and returns the most probable one. It can be started with different command line options:

./parse -g grammarFile
Parse with grammar file grammarFile and show the best parse only.
./parse -g grammarFile -a
Parse with grammar file grammarFile and show all possible parses.
./parse -g grammarFile -n tokenName
Don't expand token tokenName in parses. This is useful for correcting texts.
./parse -g grammarFile -s
Show the grammar and the lexicon.

Using the -g option is obligatory but otherwise any combination of the options can be used. Start the program as ./parse -g grammarB -a -s. It will display its lexicon and the grammar it is using. The lexicon and grammar are obtained from the file grammarB. Now type in the string f+x+f]. The program will display all possible parses of the string with line format: string, probability, parse result and rules used. Use this output for answering the questions for this assignment. After this you can enter more strings or stop the program by typing control-D or control-C.

NOTE: A number like 1.234e-05 in the output of parse means 1.234 times 10 to the power of -5 (=0.00001234).

Assignments: 3.1 | 3.2 | 3.3 | 3.4 | 3.5 | 3.6

Assignment 3.3

Use the program generate program for generating a text of 1000 correct strings. Save the result in a file. List the first ten strings in your report.

Background information on assignment 3.3

The program generate can be used for generating strings. This program can be invoked with different command line options:

./generate -g grammarFile
Generate 100 strings by using the grammar in the file grammarFile.
./generate -g grammarFile -n nbrOfStrings
Generate nbrOfStrings strings.
./generate -g grammarFile -s
Show the grammar and the lexicon.

Using the -g option is obligatory but otherwise any combination of the options can be used.

TIP: If you change a part of a Perl script then you only have to list the modified part and its original version in your report.

Assignments: 3.1 | 3.2 | 3.3 | 3.4 | 3.5 | 3.6

Assignment 3.4

Generate a version of the file with 1000 strings which is corrupted according to the error model you have used in assignment 3.1. List 10 strings with errors in your report. Compute the number of incorrect characters and the percentage of correct characters in the corrupted version.

Background information on assignment 3.4

The corrupted version can be generated with the messUp program of exercise 2. Change the program in such a way that it generates corrupted text according to the error model you have used for creating grammar B in assignment 3.1. Apply the program to the file with 1000 strings and check its output to make sure that it works as you expect. Note: Use the messUp version of 971022.

For finding out which strings are incorrect in the corrupted version you can use the Unix command diff. This command processes two files which are almost the same. It will show what parts each file differs from the other. Run the command as diff correctFile corruptedFile. In the output the lines starting with < appear only in correctFile and the lines starting with > appear only in corruptedFile.

Counting the number of incorrect characters and the percentage of correct characters can be done with the count program of exercise 2. Adapt this program to this exercise in the same way as you have done in assignment 2.3. Be aware of the fact that almost all the characters you are working with in this assignment have a special meaning for Perl. It is best to put a backslash \ before all these characters in the Perl program. Note: The count program will give a percentage correct which is higher than 33%.

NOTE: We cannot generate a corrupted text of 1000 strings with the program generate and grammar B because there would be no correspondence between the corrupted text and the correct text. The corrupted text would be the corrupted version of some other text.

Assignments: 3.1 | 3.2 | 3.3 | 3.4 | 3.5 | 3.6

Assignment 3.5

Create a problem specific unigram corrector like in assignment 2.5 and correct your corrupted text with this corrector. What percentage correct characters do you obtain?

Assignments: 3.1 | 3.2 | 3.3 | 3.4 | 3.5 | 3.6

Assignment 3.6

Correct the corrupted text with the program parse. What percentage correct characters do you obtain? List all the remaining strings with errors in your report.

Background information on assignment 3.6

The parse program has an option for not expanding a nonterminal in its results. For example: if we parse the string f-x] with grammar B then the standard parse result (field 3 in the output) will be f-x]. This is a sequence of all the tokens in the leaves of the parse tree. However we can also run parse with the option -n LSQ. Then LSQ nodes will not be expanded and the parse result becomes fLSQx].

We will use the -n option of parse for correcting the corrupted version of our text. If we run parse with the options -n LSQ and -g grammarB then it will parse all strings and show the best parse without expanding the LSQ nonterminal. If we extract the parsed string (field 3) from the result and replace all occurrences of LSQ by the correct character then we have obtained a corrected version of the file.

TIP: Extracting the x-th field from a text table with cell delimiter character y can be done with the Unix command cut -fx -d"y".

Last update: November 13, 1997. erikt@stp.ling.uu.se