This file contains information about the labs in the LRT part of the course Språkteknologiska Delområden VT98. In this lab you will design a part of a spelling checker (1-6), a grammar checker (7) or a hyphenation program (8). The assignments that you can work on are:
Note: you only have to do one of these assignments! You will have to write a report for the assignment according to the report guidelines. The deadline for handing in this report is Tuesday March 17, 1998.
You may combine two assignments if you think that it is useful and if you think you will manage handling both. Your main task is designing a system handling the topic on paper: define what tasks it should perform and how it should perform them. No programming is required although you may program if you want to. For most topics there are small programs available which you can use as a starting point for your own software.
The labs have been divided in two groups of students: an early group and a late group. Behind the user name you can find the number of the assignment they have chosen, if any. Early group: annaek (2), annano (2), camilla (7 Spanish), camilof (4), karin (7), karinsi (1), sofia (7) and viestam (1). Late group: anders (1), gertrude, gustav (1), hakan (1), mathias, natalia (4), patrik (2), perjo (5), sten (7) and stina (1 French).
The extra software and data files for this lab can be found in the directory:
/home/staff/erikt/P/st98/lrtlab/
If you want to use the software then first copy it to a directory of your own. The main file in the directory is a Swedish word list containing 84740 words (887kB). You don't need to copy this file to your own directory. The word list was extracted from Göran Andersson's Swedish files for the Ispell spelling program.
Texts usually contain formatting codes that you want to get rid off before you start spelling checking. We will assume that we need to spell check HTML files so we need to remove codes like <p> and change entities like ä in ä. Another task the preprocessor could perform is tokenization: dividing the text in words. The process of making the text ready for spelling checking is called preprocessing.
In this assignment you task is to define the tasks of a preprocessor for a spelling program. Assuming that we have to check Swedish HTML texts you can try to answer the following questions in your report:
There is a simple script for HTML code removal available for people that want to program. This program removes HTML code from a file. However if you test it on an HTML file you will see that it is not perfect. You can try to improve the script if wou want.
Additional information
Comparing words with the contents of a dictionary is an easy task. However it will not generate perfect results since we cannot include all words in the dictionary (neither all compounds nor all names). In this lab variant you will test a small dictionary checking program and examine an incomplete structured dictionary. The questions you have to answer in your report are:
The program that you can test is the script checkWords
.
It can be used in combination with the preprocessor
html2ascii
.
Copy these two files to an directory of your own and process a Swedish
text file with the command sequence:
./html2ascii.perl YourFile | ./checkWords | more
It will return a list of words that have not been recognized.
The main resource in this spelling task is the dictionary.
The checkWords
script uses the dictionary of the Ispell
spelling program.
There are two versions of the dictionary: a big word list and a small
word list with extra affix information together with an affix list.
Inspect the small word list and the affix file and suggest a few
affixes for words which do not contain one.
There is a program available for converting the affix data files to a
word list that can be used by the checkWords
script.
Additional information
checkWords
:
Tiden är din.
Names pose a special problem for a spelling checker. Often they will not be present in the dictionary but it should be possible to recognize them, for example when the first character is a capital character and the word has been spelled in the same way many times in the text. Your task in this assignment is to design a name recognizing system which can be used in a spelling checker. Try to answer the following questions in your report:
The last question points at the possibility of using frequency information for recognizing correct words that do not appear in the dictionary. You can get examples of names by extracting them from a large file, for example from one or more files of the Press65 corpus. Your recognition procedure will not be perfect. List a few cases in which it performs well and a few cases in which it fails.
Additional information
In Swedish one has the possibility of combining words and thus obtain a compound. It not possible to put all the compounds in the dictionary because there are too many of them and therefore it is desirable to have a function that checks if word is a compound. In this assignment you will design a compound recognizer for Swedish. Answer the following questions in your report:
You can use the Prolog program compound
for checking if a certain word is a compound or not.
Take a look at that Prolog program.
Notice that it is very simple: it only checks for combinations of two
words and it has a limited vocabulary.
Many extra features can be added to it.
Additional information
Spelling correction consists of two tasks: splitting the dictionary in smaller parts and generating from these parts the best alternative for a misspelling. In this assignment you will work on the second task. You may use the string similarity measure suggested by Theo Vosse (handout page 14), the one used in the example program (see below) or make a function yourself. Answer the following questions in your report:
In this assignment you will only use a small dictionary with words chosen by yourself. Still a lot of computations may be required to decide how the string distance function performs. If you like to program you may write a little program to perform the computations for you, for example in Prolog, Perl or C. There is an example Prolog program available for testing. It computes the Levenshtein distance between two strings (another variant of this algorithm is described in Vosse's book).
Additional information
In this assignment you choose a specific grammatical topic in which people frequently make errors and design some system that can recognize these errors. Examples of topics you can choose are: agreement (subject - verb or determiner - adjective - noun), reflexive pronouns, split compounds or some other error that you consider important. Try to answer the following questions in your report:
There is an example Prolog grammar checking program available. It detects simple agreement errors in Swedish noun phrases. You may try to extend it if you want. Note that grammar checking is a hard problem and your recognition system will probably miss more errors than it will detect. Try to have reasonable expectations about the results of this assignment.
Additional information
In this assignment you will design a system for determining the hyphenation points in Swedish words. Find some literature about this if you are not sure about the hyphenation rules used in Swedish. In your report you can answer the following questions:
Hyphenation systems can use different programming methodologies.
You will probably use some rule-based technique.
An example Prolog program which uses phonotactic rules is available
for testing.
The program is not perfect.
If you choose to work with it then try to extend its rule set
(validOnset/1
predicate).
If you are interested in hyphenation systems which are not based on
sets of rules then take a look at the paper by Walter Daelemans and
Antal van den Bosch.
Additional information
You will have to write a report for this lab. The report may be written by one or by two persons. Your report should contain at least the following parts:
You may write your report in English or in Swedish.