This file contains information about the labs in the LRT part of the course Språkteknologiska Delområden VT97. You can choose from two assignments:
The requirements for the reports are the same for both assignments. There is no limit on the number of persons that choose a particular exercise.
In this lab session you will design a part of a spelling checker (1-6), a grammar checker (7) or a hyphenation program (8). The parts that you can work on are:
Note: you only have to do one of these topics! The parts 3 and 4 may be combined in one assignment if you think that is necessary and if you think you will manage handling both. Your principle task is designing a system handling the topic on paper: define what tasks it should perform and how it should perform them. No programming is required although you may program if you want to. For some topics there are small programs available which you can use as a start if you want to program.
The extra software and data files for this lab can be found in the directory:
/usr/users/staff/erikt/P/st97/lrtlab/
If you want to use the software then first copy it to a directory of your own. The main file in the directory is a Swedish word list containing 84740 words (887kB). You don't need to copy this file to your own directory. The word list was extracted from Göran Andersson's Swedish files for the Ispell spelling program.
Texts usually contain formatting codes that you want to get rid off before you start spelling checking. We will assume that we need to spell check HTML files so we need to remove codes like <p> and change entities like ä in ä. Another task the preprocessor could perform is tokenization: dividing the text in words. The process of making the text ready for spelling checking is called preprocessing.
In this assignment you task is to define the tasks of a preprocessor for a spelling program. Assuming that we have to check Swedish HTML texts you can try to answer the following questions in your report:
There is a simple script for HTML code removal available for people that want to program: html2ascii. This program removes HTML code from a file. However if you test it on a file like Tiden är din (save it first as html file and run html2ascii test.html|more) you will see that it is not perfect. You can try to improve the script if wou want.
1.2. Checking Words Against a Dictionary
Comparing words with dictionary words is an easy task. However it will not generate perfect results since we can not include all words in the dictionary (neither all compounds nor all names). In this lab variant you will test a small dictionary check program. The questions you have to answer in your report are:
In the answer to the second question I am most interested in error types which occur frequently and which can be solved in the dictionary.
The program that you can test is the script checkWords. It can be used in combination with the preprocessor html2ascii. Copy these two files to an directory of your own and process the file Tiden är din with the command sequence:
html2ascii YourFile.html | checkWords | more
The main resource in this spelling task is the dictionary. The checkWords script uses the dictionary of the Ispell spelling program. There are two versions of the dictionary: a big word list (887kB) and a small word list with extra affix information (269kB) together with an affix list (6kB). Inspect the small word list and the affix file and suggest a few affixes for words which do not contain one.
1.3. Handling Capital Characters
One of the problems for dictionary-based spelling checking methods is that sentence initial words which start with a capital character can be classified as misspelllings because they do not exist in the dictionary. For an example you can run the checkWords script of the dictionary assignment You will see that the program has difficulty with handling sentence initial words. There are several solutions for this:
In your report you can answer the following questions:
Names pose a special problem for a spelling checker. Often they will not be present in the dictionary but it should be possible to recognize them for example by considering the first character and possible other appearances in the text. Your task in this assignment is to design a name recognizing system which can be used in a spelling checker. Try to answer the following questions in your report:
The last question points at the possibillity of using frequency information of words that do not appear in the dictionary for classifying them as correct.
In Swedish one has the possibillity of combining words into a longer words and thus obtain a compound. It not possible to put all the compounds in the dictionary and therefore it is desirable to have a function that checks if word is a compound. In this assignment you will design a compound recognizer for Swedish. Answer the following questions in your report:
You can use the script compoundCheck for checking if a certain word is a compound or not. This script makes use of the Prolog program compound.p. Take a look at that Prolog program. Notice that it is very simple: it only checks for combinations of two words and it has a limited vocabulary. Yet it is very slow.
You can add a few words to the Prolog program, expand the wellformedness table and test it for some compounds in the following way:
echo YourCompound | compoundCheck
If the word is echoed back by the program then it did not accept it. If the compound checker remains quiet then it has accepted the compound.
Spelling correction consists of two tasks: splitting the dictionary in smaller parts and generating from these parts the best alternative for a misspelling. In this assignment you will work on the second task. You can either use the string distance function suggested by Theo Vosse (handout page 14) or make a function yourself. Answer the following questions in your report:
In this assignment you will only use a small dictionary with words chosen by yourself. Still a lot of computations may be required to decide how the string distance function performs. Ideally you write a little program to perform the computations for you, for example in Prolog.
In this assignment you choose a specific grammatical topic in which people frequently make errors and design some system that can recognize these errors. Examples of topics you can choose are: agreement (subject - verb or determiner - adjective - noun), reflexive pronouns, split compounds or some other error that you consider important. Try to answer the following questions in your report:
Note that grammar checking is a hard problem and your recognition will probably have less hits than missing hits. Try to have reasonable expectations about the results of this assignment.
In this assignment you will design a system for determining the hyphenation points in Swedish words. Find some literature about this if you are not absolutely sure about the hyphenation rules. In your report you can answer the following questions:
Hyphenation systems can use about any programming methodology. You will probably use some rule-based technique. For other possibillities see the paper by Walter Daelemans and Antal van den Bosch.
One of the standard spelling programs on UNIX is Ispell. This program was originally written for English but at this moment it can deal with many languages, among which Swedish. You can test the Ispell program by choosing Spell in the Edit menu in the emacs editor.
Your task in this assignment is to evaluate Ispell for Swedish and test how well it performs with respect to the theoretic issues we discussed in the lectures. You can submit Ispell to tests you can design yourself. Your report only has to answer one question:
You can find more information about Ispell on the two web pages and via the manuals (man ispell and man 4 ispell). The dictionary files of Ispell are stored in the directory /usr/local/lib/ispell. Please note that these files are not quite readable for humans. If you think you understand how a spelling subtopic is handled by Ispell then you can try to describe it in your report.
You will have to write a report for this lab. The report can be written by one or by two persons. It needs to contain at least two pages (at least three if you work in a pair). Your report should contain at least the following parts:
You can write your report in English or in Swedish. The appendix can be included in the page count. The deadline for handing in the reports is Sunday April 27, 1997. Reports handed in after that day will receive a 1 point penalty per extra day.