Språkstatistik HT95:15

Språkstatistik HT95:15

Class:   15
Date:    951017
Topic:   Practical Exercise 3

Practical Exercise 3

In this practical exercise we will compare the results of a unigram corrector with a corrector based on a stochastic context-free grammar. We will apply these two algorithms to corrupted text that was originally generated with the bc grammar presented in Liberman&Schabes 1993. You will have to make a report about this practical exercise and send it by e-mail to erikt@strindberg.ling.uu.se You can send me your report until Friday November 17. Reports which are sent in after that date will receive a half point penalty per extra day.

You can find all the grammar and programs I describe here in the directory /usr/users/staff/erikt/P/ss95/pex3 You can print this exercise by choosing the print option from your web browser while viewing this text.

The grammar that we will use in this exercise can be found in the file grammar. We will call this grammar (A). Examine the grammar and try to understand the various parts of it. The grammar contains only one instruction for LSQ. We will use the same error model as in Liberman&Schabes 1993 Change the grammar in such a way that the sentences it generates are corrupt according this error model. We will call this new grammar (B).
At page 42 of Liberman&Schabes 1993 you will find two trees for x-f-x]. Actually one of these trees cannot be produced by our grammar. Which tree is that? And why can't it be produced?
Liberman&Schabes 1993 assume that their original grammar (A) can both generate x-f[x] and x[f-x]. In fact it can only generate one of these two. Change the grammar (A) in such a way it is able to generate both sentences. We will call this grammar (C). Apply the same change to grammar (B). We will call the resulting grammar (D). With this new grammar you should be able to generate the sentence x-f-x] by using different trees. We will check this in the next assignment:
We will check out a program for parsing a stochastic grammar. The name of the program is parse. You can invoke the program with different options:
1. parse
  (no options) parse with grammar file grammar and show the best parse only.
2. parse -a
  show all possible parses.
3. parse -g grammarFile
  use the grammar file grammarFile instead of grammar.
4. parse -n tokenName
  don't expand token tokenName in parses. This is useful when we will correct texts.
5. parse -s
  show the grammar and the lexicon.
Any combination of the options can be used. Start the program as parse -a -s. It will display its lexicon and the grammar it is using. The lexicon and grammar are obtained from the file grammar. Make sure that this file contains your grammar (D). Now type in the sentence x-f-x]. The program will display all possible parses of the string with line format: sentence, probability, parse result and rules used. How many parses did it generate? Which one is the most probable one? What rules were used in the most probable parse? Is it a correct one? You can enter more sentences or leave parse by typing control-D or control-C.
The program generate can be used for generating sentences. This program can be invoked with three options:
1. generate
  (no options) generate 100 sentences by using the grammar in the file grammar
2. generate -g grammarFile
  use the grammar file into grammarFile instead of grammar
3. generate -n nbrOfSentences
  generate nbrOfSentences sentences.
4. generate -s
  show the grammar and the lexicon
Any combination of the options can be used. Start the program as generate -n 10 -s to verify that it generates 10 sentences according to the grammar.

In this exercise we will compare a unigram corrector with a corrector based on a stochastic grammar. In the remainder of the exercise we will first generate a correct text, then corrupt the text with the messUp program of exercise 1 and after this we will attempt to correct the corrupted text both with a unigram model and a stochastic grammar.

Up until here you have used four grammars:

The original grammar (A) as has been provided with the software in the file grammar. We will not use this grammar in the remainder of the exercise.
A corrupted grammar (B) that was the result of the first assignment of this exercise. This grammar should be able to generate a parse for the string x-f-x]. We will not use this grammar in the remainder of the exercise.
An improved grammar (C) that was the result of the third assignment of this exercise. This grammar should be able to parse both x-f[x] and x[f-x]. We will use this grammar for generating correct texts.
A grammar (D) that was the result of the third assignment of this exercise. This grammar should be able to generate at least two parses for the string x-f-x]. We will use this grammar for correcting the corrupted text.

So we will use only grammar (C) and (D) in the rest of the text. Now we proceed with the exercise:

The corrupted text will be a text containing 100 sentences. Generate a 100 sentence perfect text like this with grammar (C). We also need a text for making a unigram data model. Generate a 1000 sentence perfect text like this with the same grammar.
Now change the messUp of practical exercise 1 to make it simulate our error model. This means that the script should replace the [-characters by one of the characters of the set { [ , + , - , / , * } at random. Use this script for generating a corrupted version of the 100 sentence text of assignment 7.
Use the unigramModel of practical exercise 1 for generating a unigram Model of the 1000 sentence text. What unigram model do you infer from this data?
Use the unigram model of assignment 8 for changing messUp into a unigram corrector just like in practical exercise 1. Apply the corrector to the text and compare the result with the original text by using the count program. What error percentage do you achieve? Is this lower than the error percentage for the corrupted text?
Now we will correct the corrupted text by using the stochastic grammar. We will use the same corrupted text. Execute this command:
```
parse -n LSQ -g corruptedGrammar < yourCorruptedText
```
corruptedGrammar needs to contain the grammar (D) for parsing corrupted text. parse will parse the text and print the best parse without expanding the LSQ tokens (this is the result of the -n LSQ option). The parsed sentences will appear as the third word on a line of output from parse (parse result). Since we are only interested in this third word, we will remove the rest of each line with an awk command:
```
parse -n LSQ -g corruptedGrammar < yourCorruptedText |\
   awk '{ printf "%s\n",$3 }'
```
The awk command will only pass the third word of each line. Some of these words contain the LSQ token. We will expand this token by using the original grammar: LSQ will become [:
```
parse -n LSQ -g corruptedGrammar < yourCorruptedText |\
   awk '{ printf "%s\n",$3 }' |\
   sed 's/LSQ/[/g'
```
The sed command will replace all occurrences of LSQ by a left square bracket. This sequence of commands will produce a corrected version of the corrupted text.
Compare the stochastically corrected version of the text with the original text by using count. What error percentage have you achieved? List all sentences with errors in your report. Has this correcting method result in a better result than the unigram method?

Send your reports to erikt@strindberg.ling.uu.se until Friday November 17. If you have any questions please ask me.

Last update: November 18, 1995. erikt@strindberg.ling.uu.se