Dokumenthantering VT98:09
These are the exercises for the third lab session of the course
Dokumenthantering VT98.
There are
9
exercises this week and
3
of them are obligatory.
The obligatory exercises have been marked with a *.
You only have to make one of them.
Write a report about the obligatory exercise you have chosen.
The report should fulfill the same requirements as the report for the
first lab.
The deadline for handing in the report for this week's exercises is
Wednesday February 18, 1998.
-
In the directory
/home/staff/web/priv/dh98/misc
you will find a Frame file (000103sv.01), the corresponding MIF file
(.mif) and the corresponding plain text file (.txt).
Compare the Frame file (in FrameMaker: start with frame4 or frame5)
with the plain text file.
Do you see something strange?
The plain text file was created by saving the Frame document as a
plain text file in FrameMaker.
Now examine the MIF file.
Find the codes for the lower case variants of the three Swedish
vowels.
Make a rough estimation of how many percent of the file is format
markup (style definition?) and how many percent text and content
type markup by finding out after how many percent the actual text
starts (use the more command).
Examine the binary Frame file by loading it in emacs.
You will see that the document contains a lot of zeroes and some
recognizable text which sometimes is markup code.
In UNIX there is a general program for extracting printable strings
from a binary file: strings.
Apply the program to the binary file.
It will show all sequences of four or more printable characters.
Do you think that is usable as a Frame to plain text converter?
-
In the directory
/home/staff/web/priv/dh98/misc
you will find a Word file, the corresponding plain text file (.txt)
and the corresponding RTF file (.rtf).
Examine the Word file by loading it in emacs.
Compare it with the text file.
Notice that the Word file contains old versions of the text which do
not appear in the text file.
Examine the RTF file as well.
You will see that the file contains only one version and little format
information.
Notice the size difference between the three documents.
- *
The Unix program tr can replace and delete characters from a
text.
Make a Perl program that simulates the replacement task of tr.
The program will take two strings as argument input and replace every
character of the first string with the corresponding character of the
second string in a file that is presented on standard input.
It may give an error message when the strings are not equally long and
it does not need to be able to recognize command line options.
[answer example]
Note: When you choose to make exercise 3 you don't have to make
exercises 4 and 5.
- *
Create a Perl program that can sort a list of words according to one of
the three sort algorithms
[answer example]
Note: When you choose to make exercise 4 you don't have to make
exercises 3 and 5.
- *
Write a Perl program that divides a text in words.
List your word definition in your report and mention possible problems
that the program has.
[answer example]
Note: When you choose to make exercise 5 you don't have to make
exercises 3 and 4.
-
Save a Word file as an HTML file with the Word program.
The program will perform the Word to HTML conversion.
Is the result in accordance with what you had expected?
Test converting different text structures (headings, font style,
tables, ...).
-
Write a toy spelling checker in Perl with a dictionary of no more than
100 words.
It should be able to detect errors in a text you may choose yourself.
No correct replacement needs to be generated for the miss-spelled
words.
-
Create a formal description of five grammatical error types in
Swedish which you think could be detected by software.
-
Write a trigram program which collects data from some corpus and gives
the most frequent successor of any bigram that the user might want to
see.
- http://www.cogs.susx.ac.uk/cgi-bin/texfaq2html?introduction=yes
-
TeX Frequently Asked Questions.
- http://www.stat.wisc.edu/computing/latex.html#emtex
-
EmTeX: LaTeX for Windows.
- ftp://ftp.primate.wisc.edu/pub/RTF/index.html
-
An ftp site with general information about RTF, specifications and
conversion tools.
- /home/staff/web/priv/dh98/misc/lang.html
-
Overview of the language environments at AIX.
- Larry Wall and Randal L. Schwartz, Programming Perl,
O'Reilly & Associates, Inc., 1992.
-
A Perl book which can be used as learning book and reference guide.
- Bengt Dahlqvist.
TSSA 2.0, A PC Program for Text Segmentation and Sorting,
Department of Linguistics, Uppsala University, 1994.
-
A tokenizing program developed at the Department of Linguistics at
Uppsala University.
- http://spectra.eng.hawaii.edu/Courses/EE150/Book/chap10/chap10.html
-
A chapter on searching and sorting in the online book
Programming in C by Bharat Kinariwala and Tep Dobry.
- http://stp.ling.uu.se/~erikt/papers/1996b.html
-
Abstract of the paper "Converting the Scania FrameMaker Documents
to TEI SGML" by Erik F. Tjong Kim Sang.
- http://www.oac.uci.edu/indiv/ehood/mifmucker.doc.html
-
MifMucker, the FrameMaker to HTML converter by Ken Harward.
- http://www.dina.kvl.dk/DinaUnix/Info/recode/recode_toc.html
-
GNU recode: a free character conversion program.
- Theo Vosse, The Word Connection. Enschede Uitgeverij, The
Netherlands. ISBN 90-75296-01-0.
-
A dissertation about spelling checking.
Last update: February 16, 1998.
erikt@stp.ling.uu.se