previous main page next

Dokumenthantering VT98:09

These are the exercises for the third lab session of the course Dokumenthantering VT98. There are 9 exercises this week and 3 of them are obligatory. The obligatory exercises have been marked with a *. You only have to make one of them.

Write a report about the obligatory exercise you have chosen. The report should fulfill the same requirements as the report for the first lab. The deadline for handing in the report for this week's exercises is Wednesday February 18, 1998.


Exercises Lab 3

  1. In the directory /home/staff/web/priv/dh98/misc you will find a Frame file (000103sv.01), the corresponding MIF file (.mif) and the corresponding plain text file (.txt). Compare the Frame file (in FrameMaker: start with frame4 or frame5) with the plain text file. Do you see something strange? The plain text file was created by saving the Frame document as a plain text file in FrameMaker.

    Now examine the MIF file. Find the codes for the lower case variants of the three Swedish vowels. Make a rough estimation of how many percent of the file is format markup (style definition?) and how many percent text and content type markup by finding out after how many percent the actual text starts (use the more command).

    Examine the binary Frame file by loading it in emacs. You will see that the document contains a lot of zeroes and some recognizable text which sometimes is markup code. In UNIX there is a general program for extracting printable strings from a binary file: strings. Apply the program to the binary file. It will show all sequences of four or more printable characters. Do you think that is usable as a Frame to plain text converter?

  2. In the directory /home/staff/web/priv/dh98/misc you will find a Word file, the corresponding plain text file (.txt) and the corresponding RTF file (.rtf). Examine the Word file by loading it in emacs. Compare it with the text file. Notice that the Word file contains old versions of the text which do not appear in the text file. Examine the RTF file as well. You will see that the file contains only one version and little format information. Notice the size difference between the three documents.

  3. * The Unix program tr can replace and delete characters from a text. Make a Perl program that simulates the replacement task of tr. The program will take two strings as argument input and replace every character of the first string with the corresponding character of the second string in a file that is presented on standard input. It may give an error message when the strings are not equally long and it does not need to be able to recognize command line options. [answer example]
    Note: When you choose to make exercise 3 you don't have to make exercises 4 and 5.

  4. * Create a Perl program that can sort a list of words according to one of the three sort algorithms [answer example]
    Note: When you choose to make exercise 4 you don't have to make exercises 3 and 5.

  5. * Write a Perl program that divides a text in words. List your word definition in your report and mention possible problems that the program has. [answer example]
    Note: When you choose to make exercise 5 you don't have to make exercises 3 and 4.

  6. Save a Word file as an HTML file with the Word program. The program will perform the Word to HTML conversion. Is the result in accordance with what you had expected? Test converting different text structures (headings, font style, tables, ...).

  7. Write a toy spelling checker in Perl with a dictionary of no more than 100 words. It should be able to detect errors in a text you may choose yourself. No correct replacement needs to be generated for the miss-spelled words.

  8. Create a formal description of five grammatical error types in Swedish which you think could be detected by software.

  9. Write a trigram program which collects data from some corpus and gives the most frequent successor of any bigram that the user might want to see.


References Week 3

http://www.cogs.susx.ac.uk/cgi-bin/texfaq2html?introduction=yes
TeX Frequently Asked Questions.

http://www.stat.wisc.edu/computing/latex.html#emtex
EmTeX: LaTeX for Windows.

ftp://ftp.primate.wisc.edu/pub/RTF/index.html
An ftp site with general information about RTF, specifications and conversion tools.

/home/staff/web/priv/dh98/misc/lang.html
Overview of the language environments at AIX.

Larry Wall and Randal L. Schwartz, Programming Perl, O'Reilly & Associates, Inc., 1992.
A Perl book which can be used as learning book and reference guide.

Bengt Dahlqvist. TSSA 2.0, A PC Program for Text Segmentation and Sorting, Department of Linguistics, Uppsala University, 1994.
A tokenizing program developed at the Department of Linguistics at Uppsala University.

http://spectra.eng.hawaii.edu/Courses/EE150/Book/chap10/chap10.html
A chapter on searching and sorting in the online book Programming in C by Bharat Kinariwala and Tep Dobry.

http://stp.ling.uu.se/~erikt/papers/1996b.html
Abstract of the paper "Converting the Scania FrameMaker Documents to TEI SGML" by Erik F. Tjong Kim Sang.

http://www.oac.uci.edu/indiv/ehood/mifmucker.doc.html
MifMucker, the FrameMaker to HTML converter by Ken Harward.

http://www.dina.kvl.dk/DinaUnix/Info/recode/recode_toc.html
GNU recode: a free character conversion program.

Theo Vosse, The Word Connection. Enschede Uitgeverij, The Netherlands. ISBN 90-75296-01-0.
A dissertation about spelling checking.


Last update: February 16, 1998. erikt@stp.ling.uu.se