Hints assignment 1.4

These are the hints for assignment 1.4. Not all these hints are necessary for successfully completing the assignment.

You need a lot of text to get reasonable counts for interesting word pairs. The t-scores shown in class were based on a text of 44.3 million words. We don't have that amount of data available. But you might try to use all the files in /corpora/ICAME/brown1 which will give you 1.0 million words.
Examples of useful pairs of words you might want to search for are black-dark, close-near and different-various.
There is an example Perl program available which you can use as start software. The program contains a lot of commenting texts which explain the code. You might want to read these carefully. The program prints counts for specific bigrams. Your task is to modify it in such a way that it outputs the t-scores for these bigrams.
If you need more information about Perl you can consult the man pages (man perl command). These manual pages are also available on the web. There will also be a Perl book available in H327.
Use your notes for determining the version of the t-score formula that you want to use. Think about what frequencies you need for this formula and decide whether you need to add count commands to the program.
(added 961011) If you work with the complete Brown corpus, running the Perl script will take more than one hour. The solution for this is to select all the interesting bigrams with grep and processing only those with the Perl script.
(added 961011) Processing a big corpus requires a lot of computer memory. You might want to do this on our largest machine: strindberg. You can access this machine by typing:

remsh strindberg

in a terminal window. After this you can execute commands on strindberg just like on your own workstation.

Last update: February 11, 1997. erikt@stp.ling.uu.se