LRT in STD98

This file contains information about the labs in the LRT part of the course Språkteknologiska Delområden VT98. In this lab you will design a part of a spelling checker (1-6), a grammar checker (7) or a hyphenation program (8). The assignments that you can work on are:

Preprocessing
Checking Words Against a Dictionary
Recognizing Names
Recognizing Compounds
Spelling Correction
Grammar Checking
Hyphenation

Note: you only have to do one of these assignments! You will have to write a report for the assignment according to the report guidelines. The deadline for handing in this report is Tuesday March 17, 1998.

You may combine two assignments if you think that it is useful and if you think you will manage handling both. Your main task is designing a system handling the topic on paper: define what tasks it should perform and how it should perform them. No programming is required although you may program if you want to. For most topics there are small programs available which you can use as a starting point for your own software.

The labs have been divided in two groups of students: an early group and a late group. Behind the user name you can find the number of the assignment they have chosen, if any. Early group: annaek (2), annano (2), camilla (7 Spanish), camilof (4), karin (7), karinsi (1), sofia (7) and viestam (1). Late group: anders (1), gertrude, gustav (1), hakan (1), mathias, natalia (4), patrik (2), perjo (5), sten (7) and stina (1 French).

The extra software and data files for this lab can be found in the directory:

/home/staff/erikt/P/st98/lrtlab/

If you want to use the software then first copy it to a directory of your own. The main file in the directory is a Swedish word list containing 84740 words (887kB). You don't need to copy this file to your own directory. The word list was extracted from Göran Andersson's Swedish files for the Ispell spelling program.

1. Preprocessing

Texts usually contain formatting codes that you want to get rid off before you start spelling checking. We will assume that we need to spell check HTML files so we need to remove codes like <p> and change entities like ä in ä. Another task the preprocessor could perform is tokenization: dividing the text in words. The process of making the text ready for spelling checking is called preprocessing.

In this assignment you task is to define the tasks of a preprocessor for a spelling program. Assuming that we have to check Swedish HTML texts you can try to answer the following questions in your report:

What HTML code has to be removed from the files by the preprocessor?
How can one define this HTML code format?
What kind of word definition will the preprocessor use?

There is a simple script for HTML code removal available for people that want to program. This program removes HTML code from a file. However if you test it on an HTML file you will see that it is not perfect. You can try to improve the script if wou want.

Additional information

Gregory Grefenstette and Pasi Tapanainen, What is a Word, What is a Sentence? Problems of Tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research, COMPLEX'94, Budapest, 1994.
Theo Vosse, The word connection, grammar-based spelling error correction in Dutch, PhD thesis University of Leiden, 1994. ISBN 90-75296-01-0. Page 177-181.
Preprocessing program html2ascii: Perl variant; Korn shell variant.
HTML test text: Tiden är din.

2. Checking Words Against a Dictionary

Comparing words with the contents of a dictionary is an easy task. However it will not generate perfect results since we cannot include all words in the dictionary (neither all compounds nor all names). In this lab variant you will test a small dictionary checking program and examine an incomplete structured dictionary. The questions you have to answer in your report are:

Can you discover regularities in the false alarms that the program produces? Are some word classes misclassified more often than others?
What kind of improvements can be made in the structured dictionary?
Is there extra information beside word form and grammatical category and features that could be interesting for having available in the dictionary?

The program that you can test is the script checkWords. It can be used in combination with the preprocessor html2ascii. Copy these two files to an directory of your own and process a Swedish text file with the command sequence:

./html2ascii.perl YourFile | ./checkWords | more

It will return a list of words that have not been recognized. The main resource in this spelling task is the dictionary. The checkWords script uses the dictionary of the Ispell spelling program. There are two versions of the dictionary: a big word list and a small word list with extra affix information together with an affix list. Inspect the small word list and the affix file and suggest a few affixes for words which do not contain one. There is a program available for converting the affix data files to a word list that can be used by the checkWords script.

Additional information

Dictionary checking program checkWords (Korn shell script).
Preprocessing program html2ascii: Perl variant; Korn shell variant.
Dictionary files: only words (887kB); words (269kB) with morphological affixes (6kB).
Program for converting Ispell affix data to word list: makeIWL (Korn shell).
Suggested text for testing checkWords: Tiden är din.
Advanced test text: /corpora/Press65/UnixCorrect/p65.001

3. Recognizing Names

Names pose a special problem for a spelling checker. Often they will not be present in the dictionary but it should be possible to recognize them, for example when the first character is a capital character and the word has been spelled in the same way many times in the text. Your task in this assignment is to design a name recognizing system which can be used in a spelling checker. Try to answer the following questions in your report:

How can names be recognized?
Test your recognition procedure on some text. How does it perform?
Can your recognition system also be used for handling frequent unknown words? If so, how? Will it increase the risk of missing hits?

The last question points at the possibility of using frequency information for recognizing correct words that do not appear in the dictionary. You can get examples of names by extracting them from a large file, for example from one or more files of the Press65 corpus. Your recognition procedure will not be perfect. List a few cases in which it performs well and a few cases in which it fails.

Additional information

Theo Vosse, The word connection, grammar-based spelling error correction in Dutch, PhD thesis University of Leiden, 1994. ISBN 90-75296-01-0. Page 183-189.
Suggested text for extracting names from: /corpora/Press65/UnixCorrect/p65.001

4. Recognizing Compounds

In Swedish one has the possibility of combining words and thus obtain a compound. It not possible to put all the compounds in the dictionary because there are too many of them and therefore it is desirable to have a function that checks if word is a compound. In this assignment you will design a compound recognizer for Swedish. Answer the following questions in your report:

What word classes can be combined with each other in Swedish compounds?
Is it possible to design a simple rule which decides whether the binding morpheme s is necessary in a Swedish compound or not? Or should the dictionary contain additional information to be able to solve this problem?
Is it possible to design a compound recognizer in such a way that language dependent information would be modifyable in an easy way (like changing some rule files)?

You can use the Prolog program compound for checking if a certain word is a compound or not. Take a look at that Prolog program. Notice that it is very simple: it only checks for combinations of two words and it has a limited vocabulary. Many extra features can be added to it.

Additional information

Theo Vosse, The word connection, grammar-based spelling error correction in Dutch, PhD thesis University of Leiden, 1994. ISBN 90-75296-01-0. Page 62-68.
Compound recognition program compound (Prolog).

5. Spelling Correction

Spelling correction consists of two tasks: splitting the dictionary in smaller parts and generating from these parts the best alternative for a misspelling. In this assignment you will work on the second task. You may use the string similarity measure suggested by Theo Vosse (handout page 14), the one used in the example program (see below) or make a function yourself. Answer the following questions in your report:

What distance function did you choose and why?
Test your function on a few spelling errors that you can think of yourself. How does your function perform?
Does your function have a preference for mistypings, competence errors or for specific subsets of these?

In this assignment you will only use a small dictionary with words chosen by yourself. Still a lot of computations may be required to decide how the string distance function performs. If you like to program you may write a little program to perform the computations for you, for example in Prolog, Perl or C. There is an example Prolog program available for testing. It computes the Levenshtein distance between two strings (another variant of this algorithm is described in Vosse's book).

Additional information

Theo Vosse, The word connection, grammar-based spelling error correction in Dutch, PhD thesis University of Leiden, 1994. ISBN 90-75296-01-0. Page 70-81,88-91.
Example string similarity measure program: levenshtein (Prolog).

6. Grammar Checking

In this assignment you choose a specific grammatical topic in which people frequently make errors and design some system that can recognize these errors. Examples of topics you can choose are: agreement (subject - verb or determiner - adjective - noun), reflexive pronouns, split compounds or some other error that you consider important. Try to answer the following questions in your report:

What grammatical topic did you choose? Give a few error examples of this topic.
How does your recognition system work?
What errors does it recognize? Where does it fail?

There is an example Prolog grammar checking program available. It detects simple agreement errors in Swedish noun phrases. You may try to extend it if you want. Note that grammar checking is a hard problem and your recognition system will probably miss more errors than it will detect. Try to have reasonable expectations about the results of this assignment.

Additional information

Theo Vosse, The word connection, grammar-based spelling error correction in Dutch, PhD thesis University of Leiden, 1994. ISBN 90-75296-01-0. Page 99-116.
Example grammar checking program: grammarCheck (Prolog).

7. Hyphenation

In this assignment you will design a system for determining the hyphenation points in Swedish words. Find some literature about this if you are not sure about the hyphenation rules used in Swedish. In your report you can answer the following questions:

Are phonotactic and morphological hyphenation rules usable for Swedish? If so, do you need both rule types or would one of them cover almost all words? If not, what other rules should you use?
How does your hyphenation system work?
Test your hyphenation system. How does it perform? What errors does it make?

Hyphenation systems can use different programming methodologies. You will probably use some rule-based technique. An example Prolog program which uses phonotactic rules is available for testing. The program is not perfect. If you choose to work with it then try to extend its rule set (validOnset/1 predicate). If you are interested in hyphenation systems which are not based on sets of rules then take a look at the paper by Walter Daelemans and Antal van den Bosch.

Additional information

Walter Daelemans and Antal van den Bosch. Generalization Performance of Backpropagation Learning on a Syllabification Task. In: M.F.J. Drossaers and A. Nijholt (eds.) Connectionism and Natural Language Processing. Proceedings Third Twente Workshop on Language Technology, 27-38, 1992.
Theo Vosse, The word connection, grammar-based spelling error correction in Dutch, PhD thesis University of Leiden, 1994. ISBN 90-75296-01-0. Page 157-176.
Example hyphenation program: hyphenate (Prolog).

8. Report

You will have to write a report for this lab. The report may be written by one or by two persons. Your report should contain at least the following parts:

Introduction: a description of the assignment and a summary of the task(s) you have performed.
Answers to the questions in this exercise.
A description of your design
Your expectation about its performance in general and the results of tests if you have performed any.
Other comments on the assignments or the results
In the appendix: software that you have developed for the assignment if you did have written any.

You may write your report in English or in Swedish.

Last update: April 11, 1998. erikt@stp.ling.uu.se