Dokumenthantering VT97:07
These are the exercises and references for the seventh class of
the course Dokumenthanteringen
Exercises
The results of the exercises marked with * have to be handed in.
The exercises marked with ? are optional obligatory exercises: you
only have to hand in the results of one of them
- *
Change your sentence / word tokenizer from class four to a program
that generates an inverted file for bare case insensitive words with
pointers to sentence numbers or intervals between sentence numbers.
Bare words means words without punctuation marks or other non-word
characters.
- *
Apply the program of the first exercise at the text "Om Uppsala
universitet" which can be found in the file
/usr/users/staff/erikt/html/dh97/uppsala.txt
Estimate the size of this text if you store it by using a minimal
number of bits (five or six per character and number).
Did the size decrease in comparison with the original file?
References
- [WMB94] Ian H. Witten, Alistair Moffat and Timothy C. Bell. "Managing
Gigabytes, Compression and Indexing Documents and Images", Van
Nostrand Reinhold, 1994.
- http://www.cs.waikato.ac.nz/~ihw/mg.html
Managing Gigabytes web site in New Zealand.
Last update: April 16, 1997.
erikt@stp.ling.uu.se