main page

Språkteknologiska delområden VT96:22

Classes: 22 & 23 & 24
Dates:   960312 & 960314 & 960315
Topic:   Practical exercise Language Revision Tools

Practical exercise Language Revision Tools

In this exercise you will work with a small modular spelling correction program for Swedish html files. The most important goal of this exercise is not building a perfect Swedish spelling checker but learning to recognize some of the practical problems that pop up when designing a spelling checker. If you like programming you can do some programming work on the spelling checker but that is optional.

You will write a small report about this exercise. The mark you will receive for this report will be your mark for the LRT part of this course. You can either do this exercise on your own or work in a pair.

All the software for this exercise can be found in the directory:

/home/staff/web/priv/st96/stava/
The word list that I have used comes from different sources. The most important one is the Swedish word list that comes with ispell. The files for the word list can be found on:
ftp://ftp.ida.liu.se/pub/bibframe/svordlista

Introduction

Spelling checking can be performed by several modules:

Preprocessing

Texts usually contain formatting codes that you want to get rid off before you start spelling checking. We need to spell check html files so we need to remove codes like <p> and change character tokens like &auml; in ä. The process of making the text ready for spelling checking is called preprocessing.

html2ascii is a simple shell script that attempts to convert html to ascii. Take a look at the script and try to understand as many of its parts as possible. The script makes a lot of use of the substitute command of sed. For example:

sed 's/A/B/g'

means substitute (s) all the occurrences (g=global) of A in a text by B. One sed command in the script performs a conversion like that and then passes the text to the next sed command which performs another conversion.

The script is not perfect. You can test it by running it on the file Om Uppsala universitet (save it first as html file and run html2ascii test.html|more). Your assignment here is to change html2ascii in such a way that it is able to remove the extra html code from this html file as well. Note: you can get é in emacs by typing Control-q followed by 351 and É by Control-q followed by 311.

Isolated word check

checkWords will extract the words from a text and compare them with the words in a dictionary. This is equivalent to performing isolated spelling checks. Read the program and try to understand as many of its parts as possible. Then run the program of the output of html2ascii in the following way:

html2ascii YourFile.html | checkWords

You will find out that this program generates many false alarms. Try to find out the cause of as many of these as possible and write down these causes. If you think that you can solve few of them by changing the checkWords program then change the program. If you have problems with converting your ideas into programming code then ask the teacher for help.

Capital characters

One of the problems that you will see in the output of checkWords is that the sentence initial words start with a capital character and words with a capital character can in general not be found in the dictionary. There are several solutions for this:

  1. Convert both the dictionary and the word list to lower case characters. This solves the problem but it will prevent you from detecting spelling errors as sverige instead of Sverige.
  2. Convert only the words in the beginning of a sentence to lower case characters. This solves the problem but if a sentence start with a name (Sverige) then converting the first character to lower case will result in a false alarm.
  3. Accept that every lower case character in a dictionary word may be converted into a capital character in the text but add the constraint that a capital character in the dictionary may never be converted into a lower case character in the text. This will solve both problems.

Which of these solutions do you prefer? Or can you think of another good solution for handling capitalized words? You don't need to implement this part of the spelling checker. Writing down your idea's is enough.

Compounding

In Swedish one has the possibillity of combining words into a longer words and thus obtain a compound. It not possible to put all the compounds in the dictionary and therefore it is desirable to have a function that checks if word is a compound. You can use the script compoundCheck for checking if a certain word is a compound or not. This script makes use of the Prolog program compound.p. Take a look at that Prolog program. Notice that it is very simple: it only checks for combinations of two words and it has a limited vocabulary. Yet it is very slow.

Add a few words to the Prolog program and test it for some compounds in the following way:

echo YourCompound | compoundCheck

If the word is echoed back by the program then it did not accept it. If the compound checker remains quiet then it has accepted the compound.

You may have noticed that the compounds made by the program contain no binding morpheme as in forskningsingenjör. Do you think that it is possible to make a general rule for Swedish that states when the binding morpheme is necessary and when not? Or should this information be encoded in the dictionary?

Assignment

Report

You will have to write a report of at least two pages (three if you work in a pair) about this assignment. Your report should contain at least the following parts:

  1. Introduction: a summary of the task(s) you have performed in this assignment.
  2. Answers to the questions in this exercise.
  3. What weak (or strong) points did you discover in this simple spelling software?
  4. What are your recommendations for changes or improvements in the described modules of this spelling checker?
  5. Do you think that the general build up of this spelling checker allows adding other modules? If your answer is yes then describe what modules could be added. If your answer is no describe what should be changed in the general setup of the spelling checker to allow adding extra advanced modules.

You can write your report in English or in Swedish. Your report will be graded with a mark between 1 and 10 (inclusive). The deadline for handing in the reports is Thursday April 4, 1996. Reports handed in after that day will receive a 1 point penalty per extra day.


Last update: April 23, 1999. erikt@stp.ling.uu.se