Hints assignment 1.2

These are the hints for assignment 1.2. Not all these hints are necessary for successfully completing the assignment.

  1. There should be a solution to this exercise in your lecture notes. But you need to add some commands for cleaning up the input a little bit before applying that solution.

  2. You can use the tokenization commands that you have created in assignment 1.1 in this exercise as well.

  3. You can use the command paste to combine two files by pasting every line x of FILE1 behind line x of FILE2: "paste FILE1 FILE2". You can also use paste with the standard input: "|paste FILE1 -". This will paste the standard input behind the lines of FILE1.

  4. The trick here is to get a file with all words of the corpus (non-sorted!) but without the first word. You can delete lines from the beginning of a corpus with the command "tail +10 FILE" which will select all lines starting from line 10. An alternative is "sed '1,9d' FILE" which deletes line 1 to 9 from FILE and shows the result at standard output. You can replace the numbers by appropriate values.

  5. If you want to know more about a command you can use the command man, for example "man sort" to get more information about the sort command. The manual pages are usually long and not everybody's favorite stuff to read but it would be nice if you are able to extract from them the information that you need in some quick way.

  6. If your set of commands grows too large then you can put them in a script. This makes it easier to modify them. A script could look like this:
        #!/bin/ksh
        # this comment line should have told you what the script was doing
        TMPFILE="/tmp/script.$$"
        tr -sc '[A-Za-z]' '[\012*]' brown1_a.txt |\
           grep 'ing$' | sort | uniq -c | sort -nr > $TMPFILE
        more $TMPFILE
        rm $TMPFILE
        
    Don't forget to change the permission bits of your script with "chmod 755 SCRIPT" otherwise you will get an error message when you try to run it. If you create a temporary file in the script then please remove it at the end of the script (rm command).

  7. Your unigram word list may start with an empty line and your bigram word list will end with a line containing one word. Are you going to regard these as unigrams, bigrams or nothing?


Last update: September 26, 1996. erikt@stp.ling.uu.se