Språkstatistik HT96:22

Språkstatistik HT96:22

In this practical exercise we will compare the results of a unigram corrector with a corrector based on a stochastic context-free grammar. We will apply these two algorithms to corrupted text that was originally generated with the bc grammar presented in Liberman&Schabes 1993.

You can find all the grammar and programs for this exercise in the directory /usr/users/staff/erikt/P/ss96/pex3 Copy all these files to one of your own directories.

This practical exercise does not contain optional parts.

Assignment 3.1

The grammar that we will use in this exercise can be found in the file grammar. This will be grammar A. Examine the grammar and try to understand its parts. The grammar contains only one rule for LSQ. We will use the same error model as in Liberman&Schabes 1993. Change the grammar in such a way that the sentences it generates are corrupt according this error model. This corrupt grammar will be called grammar B.

Assignment 3.2

We will check out a program for parsing a stochastic grammar. The name of the program is parse. You can invoke the program with different options:

parse
(no options) parse with grammar file grammar and show the best parse only.
parse -a
show all possible parses.
parse -g grammarFile
use the grammar file grammarFile instead of grammar.
parse -n tokenName
don't expand token tokenName in parses. This is useful when we will correct texts.
parse -s
show the grammar and the lexicon.

Any combination of the options can be used. Start the program as parse -a -s. It will display its lexicon and the grammar it is using. The lexicon and grammar are obtained from the file grammar. Make sure that this file contains your grammar B. Now type in the sentence x-f-x]. The program will display all possible parses of the string with line format: sentence, probability, parse result and rules used. How many parses did it generate? Which one is the most probable one? What rules were used in the most probable parse? You can enter more sentences or stop the program by typing control-D or control-C.

Note: a number like 1.234e-05 in the output of parse means 1.234 times 10 to the power -5 (=0.00001234).

Assignment 3.3

The program generate can be used for generating sentences. This program can be invoked with three options:

generate
(no options) generate 100 sentences by using the grammar in the file grammar
generate -g grammarFile
use the grammar file into grammarFile instead of grammar
generate -n nbrOfSentences
generate nbrOfSentences sentences.
generate -s
show the grammar and the lexicon

Any combination of the options can be used. Start the program as generate -n 10 -s to verify that it generates 10 sentences according to grammar A. Run the program one time with the corrupt grammar B. List 10 sentences of the correct grammar and 10 sentences of the corrupt grammar in your report.

Assignment 3.4

In this exercise we will compare a unigram corrector with a corrector based on a stochastic grammar. In the remainder of the exercise we will first generate a correct text, then corrupt the text with the messUp program of exercise 2 and after this we will attempt to correct the corrupted text both with a unigram model and a stochastic grammar.

The corrupted text will be a text containing 100 sentences. Generate a 100 sentence text with the correct grammar A. We also need a text for making a unigram data model. Generate a 1000 sentence perfect text like this with the same grammar.

Now change the messUp of practical exercise 2 to make it simulate our error model. This means that the script should replace the [-characters by one of the characters of the set { [ , + , - , / , * } at random. Use this script for generating a corrupted version of the 100 sentence text.

Modify the count program of practical exercise 2 to make it compute the error rate for the characters that can be messed up by this error model and the characters that can be messed up by the unigram corrector of assignment 3.5. What is the error percentage of the corrupted text?

Assignment 3.5

Now create an unigram corrector and correct your corrupted text with this corrector. What error percentage do you obtain?

Assignment 3.6

Now we will correct the corrupted text by using the stochastic grammar. We will use the same corrupted text. Execute this command:

parse -n LSQ -g corruptGrammar < yourCorruptedText

corruptGrammar needs to contain the grammar B for generating corrupted text. parse will parse the text and print the best parse without expanding the LSQ tokens (this is the result of the -n LSQ option). The parsed sentences will appear as the third word on each output line. Select this word from the output and replace all the occurences of LSQ by the [-character. Now you have obtained a text that was corrected by a stochastic grammar. List the commands that you have used for this in your report.

Compare the stochastically corrected version of the text with the original text by using the count program. What error percentage have you achieved? List all sentences with errors in your report. Has this correcting method result in a better result than the unigram method?

Last update: November 25, 1996. erikt@stp.ling.uu.se