Classes: 22 & 23 & 24 Dates: 960312 & 960314 & 960315 Topic: Practical exercise Language Revision Tools
In this exercise you will work with a small modular spelling correction program for Swedish html files. The most important goal of this exercise is not building a perfect Swedish spelling checker but learning to recognize some of the practical problems that pop up when designing a spelling checker. If you like programming you can do some programming work on the spelling checker but that is optional.
You will write a small report about this exercise. The mark you will receive for this report will be your mark for the LRT part of this course. You can either do this exercise on your own or work in a pair.
All the software for this exercise can be found in the directory:
/home/staff/web/priv/st96/stava/The word list that I have used comes from different sources. The most important one is the Swedish word list that comes with ispell. The files for the word list can be found on:
ftp://ftp.ida.liu.se/pub/bibframe/svordlista
Spelling checking can be performed by several modules:
Texts usually contain formatting codes that you want to get rid off before you start spelling checking. We need to spell check html files so we need to remove codes like <p> and change character tokens like ä in ä. The process of making the text ready for spelling checking is called preprocessing.
html2ascii is a simple shell script that attempts to convert html to ascii. Take a look at the script and try to understand as many of its parts as possible. The script makes a lot of use of the substitute command of sed. For example:
sed 's/A/B/g'
means substitute (s) all the occurrences (g=global) of A in a text by B. One sed command in the script performs a conversion like that and then passes the text to the next sed command which performs another conversion.
The script is not perfect. You can test it by running it on the file Om Uppsala universitet (save it first as html file and run html2ascii test.html|more). Your assignment here is to change html2ascii in such a way that it is able to remove the extra html code from this html file as well. Note: you can get é in emacs by typing Control-q followed by 351 and É by Control-q followed by 311.
checkWords will extract the words from a text and compare them with the words in a dictionary. This is equivalent to performing isolated spelling checks. Read the program and try to understand as many of its parts as possible. Then run the program of the output of html2ascii in the following way:
html2ascii YourFile.html | checkWords
You will find out that this program generates many false alarms. Try to find out the cause of as many of these as possible and write down these causes. If you think that you can solve few of them by changing the checkWords program then change the program. If you have problems with converting your ideas into programming code then ask the teacher for help.
One of the problems that you will see in the output of checkWords is that the sentence initial words start with a capital character and words with a capital character can in general not be found in the dictionary. There are several solutions for this:
Which of these solutions do you prefer? Or can you think of another good solution for handling capitalized words? You don't need to implement this part of the spelling checker. Writing down your idea's is enough.
In Swedish one has the possibillity of combining words into a longer words and thus obtain a compound. It not possible to put all the compounds in the dictionary and therefore it is desirable to have a function that checks if word is a compound. You can use the script compoundCheck for checking if a certain word is a compound or not. This script makes use of the Prolog program compound.p. Take a look at that Prolog program. Notice that it is very simple: it only checks for combinations of two words and it has a limited vocabulary. Yet it is very slow.
Add a few words to the Prolog program and test it for some compounds in the following way:
echo YourCompound | compoundCheck
If the word is echoed back by the program then it did not accept it. If the compound checker remains quiet then it has accepted the compound.
You may have noticed that the compounds made by the program contain no binding morpheme as in forskningsingenjör. Do you think that it is possible to make a general rule for Swedish that states when the binding morpheme is necessary and when not? Or should this information be encoded in the dictionary?
You will have to write a report of at least two pages (three if you work in a pair) about this assignment. Your report should contain at least the following parts:
You can write your report in English or in Swedish. Your report will be graded with a mark between 1 and 10 (inclusive). The deadline for handing in the reports is Thursday April 4, 1996. Reports handed in after that day will receive a 1 point penalty per extra day.