Hints assignment 1.2
These are the hints for
assignment 1.2.
Not all these hints are necessary for successfully completing the
assignment.
- There should be a solution to this exercise in your lecture notes.
But you need to add some commands for cleaning up the input a
little bit before applying that solution.
- You can use the tokenization commands that you have created in
assignment 1.1 in this exercise as well.
- You can use the command paste to combine two files by pasting every
line x of FILE1 behind line x of FILE2: "paste FILE1 FILE2". You
can also use paste with the standard input: "|paste FILE1 -". This
will paste the standard input behind the lines of FILE1.
- The trick here is to get a file with all words of the corpus
(non-sorted!) but without the first word. You can delete lines
from the beginning of a corpus with the command "tail +10 FILE"
which will select all lines starting from line 10. An alternative
is "sed '1,9d' FILE" which deletes line 1 to 9 from FILE and
shows the result at standard output. You can replace the numbers
by appropriate values.
- If you want to know more about a command you can use the command
man, for example "man sort" to get more information about the sort
command. The manual pages are usually long and not everybody's
favorite stuff to read but it would be nice if you are able to
extract from them the information that you need in some quick
way.
- If your set of commands grows too large then you can put them in a
script. This makes it easier to modify them. A script could look
like this:
#!/bin/ksh
# this comment line should have told you what the script was doing
TMPFILE="/tmp/script.$$"
tr -sc '[A-Za-z]' '[\012*]' brown1_a.txt |\
grep 'ing$' | sort | uniq -c | sort -nr > $TMPFILE
more $TMPFILE
rm $TMPFILE
Don't forget to change the permission bits of your script with
"chmod 755 SCRIPT" otherwise you will get an error message when
you try to run it. If you create a temporary file in the script
then please remove it at the end of the script (rm command).
- Your unigram word list may start with an empty line and your bigram
word list will end with a line containing one word. Are you going
to regard these as unigrams, bigrams or nothing?
Last update: September 26, 1996.
erikt@stp.ling.uu.se