previous main page next

Dokumenthantering VT97:10

These are the exercises and references for the tenth class of the course Dokumenthanteringen


Exercises

The results of the exercises marked with * have to be handed in. The exercises marked with ? are optional obligatory exercises: you only have to hand in the results of one of them

  1. The software that accompanies the book Managing Gigabytes consists of different programs. In this assignment you will test two of this programs. The first is mgbuild. Issue the command "mgbuild alice". It will a compressed inverted version of the book "Alice in Wonderland" by Lewis Carrol. The book can be found in the file

    /home/staff/erikt/misc/mg/mg-1.2/SampleData/alice13a.txt.Z

    The inverted corpus will be stored in a directory which you have specified in the shell variable MGDATA. Issue the command "export MGDATA=someDirectory" before you start mgbuild. You can access the inverted corpus with the command mgquery. There are manuals available for mg, mgbuild and mgquery.

    If you want to apply the software to a larger corpus then you can try using it for the Brown corpus. This corpus contains approximately one million words divided over 500 documents. You can find the corpus in the directory /corpora/ICAME/brown1. In the corpus each line has been marked with the code XNN MMM in which X is a captical character indicating the document type, NN is a document number for a specific type and MMMM is a line number. In order to make use of the corpus you need to convert to a one document per file corpus in which the line markers have been removed.

    Note that if you want to apply the mg software to your own material (like the Brown corpus) then you need to make your own version of /usr/local/bin/mg_get and call mgbuild with option -g yourMg_get .


References


Last update: April 17, 1997. erikt@stp.ling.uu.se