Statistical NLP: Exercise 2.6

Overview | 2.1 | 2.2 | 2.3 | 2.4 | 2.5 | 2.6

2.6 Correction with a unigram model

A statistical grammar is a useful tool for correcting arithmetical expressions. In this case we could use such a grammar but often a tool like this is unavailable. Creating a statistical grammar is a lot of work. It is often easier to construct an n-gram model. We will now attempt to solve our correction problem with a unigram model.

Correction with unigram models works as follows:

determine for each symbol with errors in the data what it could have been in the original data, and
replace each of these symbols with the most probable original symbol.

In order to determine how often each symbol appears in the data we have counted the symbols in a list of 10,000 expressions. This was the result:

The x occurs most often, followed by the f and so on. In our error model three symbols are involved in the errors. We can try to correct these errors by replacing one or more of these symbols by another. Specify what symbols you want to replace by another symbol in the list of incorrect expressions:

When you press the Start correction button, you will correct a list of 1000 expressions with your unigram model.

Last update: November 20, 2003. erikt@uia.ua.ac.be