Hints assignment 1.4
These are the hints for
assignment 1.4.
Not all these hints are necessary for successfully completing the
assignment.
- You need a lot of text to get reasonable counts for interesting
word pairs. The t-scores shown in class were based on a text of
44.3 million words. We don't have that amount of data available.
But you might try to use all the files in /corpora/ICAME/brown1
which will give you 1.0 million words.
- Examples of useful pairs of words you might want to search for
are black-dark, close-near and different-various.
- There is an
example Perl program available which you
can use as start software. The program contains a lot of commenting
texts which explain the code. You might want to read these
carefully. The program prints counts for specific bigrams. Your
task is to modify it in such a way that it outputs the t-scores
for these bigrams.
- If you need more information about Perl you can consult the man
pages (man perl command). These manual pages are also available
on the
web.
There will also be a Perl book available in H327.
- Use your notes for determining the version of the t-score
formula that you want to use. Think about what frequencies you
need for this formula and decide whether you need to add count
commands to the program.
- (added 961011) If you work with the complete Brown corpus,
running the Perl script will take more than one hour. The
solution for this is to select all the interesting bigrams
with grep and processing only those with the Perl script.
- (added 961011) Processing a big corpus requires a lot
of computer memory. You might want to do this on our largest
machine: strindberg. You can access this machine by typing:
remsh strindberg
in a terminal window. After this you can execute commands
on strindberg just like on your own workstation.
Last update: February 11, 1997.
erikt@stp.ling.uu.se