Språkstatistik HT95:15
Class: 15
Date: 951017
Topic: Practical Exercise 3
Practical Exercise 3
In this practical exercise we will compare the results of a unigram
corrector with a corrector based on a stochastic context-free grammar.
We will apply these two algorithms to corrupted text that was
originally generated with the bc grammar presented in
Liberman&Schabes 1993.
You will have to make a report about this practical exercise
and send it by e-mail to erikt@strindberg.ling.uu.se
You can send me your report until Friday November 17.
Reports which are sent in after that date will receive a half point
penalty per extra day.
You can find all the grammar and programs I describe here in
the directory
/usr/users/staff/erikt/P/ss95/pex3
You can print this exercise by choosing the print option from your
web browser while viewing this text.
-
The grammar that we will use in this exercise can be found in the file
grammar.
We will call this grammar (A).
Examine the grammar and try to understand the various parts of it.
The grammar contains only one instruction for LSQ.
We will use the same error model as in Liberman&Schabes 1993
Change the grammar in such a way that the sentences it generates are
corrupt according this error model.
We will call this new grammar (B).
-
At page 42 of Liberman&Schabes 1993 you will find two
trees for x-f-x].
Actually one of these trees cannot be produced by our grammar.
Which tree is that?
And why can't it be produced?
-
Liberman&Schabes 1993 assume that their original grammar (A)
can both generate x-f[x] and x[f-x].
In fact it can only generate one of these two.
Change the grammar (A) in such a way it is able to generate both
sentences.
We will call this grammar (C).
Apply the same change to grammar (B).
We will call the resulting grammar (D).
With this new grammar you should be able to generate
the sentence x-f-x] by using different trees.
We will check this in the next assignment:
-
We will check out a program for parsing a stochastic grammar.
The name of the program is parse.
You can invoke the program with different options:
-
parse
(no options) parse with grammar file grammar
and show the best parse only.
-
parse -a
show all possible parses.
-
parse -g grammarFile
use the grammar file grammarFile instead of grammar.
-
parse -n tokenName
don't expand token tokenName in parses.
This is useful when we will correct texts.
-
parse -s
show the grammar and the lexicon.
Any combination of the options can be used.
Start the program as parse -a -s.
It will display its lexicon and the grammar it is using.
The lexicon and grammar are obtained from the file grammar.
Make sure that this file contains your grammar (D).
Now type in the sentence x-f-x].
The program will display all possible parses of the string with line
format: sentence, probability, parse result and rules used.
How many parses did it generate?
Which one is the most probable one?
What rules were used in the most probable parse?
Is it a correct one?
You can enter more sentences or leave parse by typing
control-D or control-C.
-
The program generate can be used for generating sentences.
This program can be invoked with three options:
-
generate
(no options) generate 100 sentences by using the
grammar in the file grammar
-
generate -g grammarFile
use the grammar file into grammarFile instead of grammar
-
generate -n nbrOfSentences
generate nbrOfSentences sentences.
-
generate -s
show the grammar and the lexicon
Any combination of the options can be used.
Start the program as generate -n 10 -s to verify that it
generates 10 sentences according to the grammar.
In this exercise we will compare a unigram corrector with a corrector
based on a stochastic grammar.
In the remainder of the exercise we will first generate a correct
text, then corrupt the text with the messUp program of exercise 1 and
after this we will attempt to correct the corrupted text both with a
unigram model and a stochastic grammar.
Up until here you have used four grammars:
-
The original grammar (A) as has been provided with the software in the
file grammar.
We will not use this grammar in the remainder of the exercise.
-
A corrupted grammar (B) that was the result of the first assignment of
this exercise.
This grammar should be able to generate a parse for the
string x-f-x].
We will not use this grammar in the remainder of the exercise.
-
An improved grammar (C) that was the result of the third assignment of
this exercise.
This grammar should be able to parse both x-f[x] and
x[f-x].
We will use this grammar for generating correct texts.
-
A grammar (D) that was the result of the third assignment of
this exercise.
This grammar should be able to generate at least two parses for the
string x-f-x].
We will use this grammar for correcting the corrupted text.
So we will use only grammar (C) and (D) in the rest of the text.
Now we proceed with the exercise:
-
The corrupted text will be a text containing 100 sentences.
Generate a 100 sentence perfect text like this with grammar (C).
We also need a text for making a unigram data model.
Generate a 1000 sentence perfect text like this with the same grammar.
-
Now change the messUp of practical
exercise 1 to make it simulate our error model.
This means that the script should replace the [-characters
by one of the characters of the set { [ , + , - , / , * } at random.
Use this script for generating a corrupted version of the 100 sentence
text of assignment 7.
-
Use the unigramModel of practical
exercise 1 for generating a unigram Model of the 1000 sentence text.
What unigram model do you infer from this data?
-
Use the unigram model of assignment 8 for changing
messUp
into a unigram corrector just like in practical exercise 1.
Apply the corrector to the text and compare the result with the
original text by using the count program.
What error percentage do you achieve?
Is this lower than the error percentage for the corrupted text?
-
Now we will correct the corrupted text by using the stochastic
grammar.
We will use the same corrupted text.
Execute this command:
parse -n LSQ -g corruptedGrammar < yourCorruptedText
corruptedGrammar needs to contain the grammar (D) for parsing
corrupted text.
parse will parse the text and print the best parse without
expanding the LSQ tokens (this is the result of the -n LSQ
option).
The parsed sentences will appear as the third word on a line of output
from parse (parse result).
Since we are only interested in this third word, we will remove the
rest of each line with an awk command:
parse -n LSQ -g corruptedGrammar < yourCorruptedText |\
awk '{ printf "%s\n",$3 }'
The awk command will only pass the third word of each line.
Some of these words contain the LSQ token.
We will expand this token by using the original grammar: LSQ will
become [:
parse -n LSQ -g corruptedGrammar < yourCorruptedText |\
awk '{ printf "%s\n",$3 }' |\
sed 's/LSQ/[/g'
The sed command will replace all occurrences of LSQ by a left square
bracket.
This sequence of commands will produce a corrected version of the
corrupted text.
-
Compare the stochastically corrected version of the text with the
original text by using count.
What error percentage have you achieved?
List all sentences with errors in your report.
Has this correcting method result in a better result than the unigram
method?
Send your reports to erikt@strindberg.ling.uu.se until Friday November 17.
If you have any questions please ask me.
Last update: November 18, 1995.
erikt@strindberg.ling.uu.se