Språkstatistik HT97:15

Practical exercise 1

This is the first practical exercise in the course Språkstatistik HT97. It deals with the application of UNIX commands for the extraction of data from a corpus.

Assignments: 1.1 | 1.2 | 1.3 | 1.4

Deadline

The deadline for handing in reports for this exercise is Tuesday October 21, 1997. Late reports will receive one penalty point per extra day.

General

This first practical exercise contains four assignments. In each assignment you are required to generate some output. Make a report of these assignments and include in the report the required output and the code that you have used for generating it. The report should contain:

A description of each assignment
The decisions you had to make for each assignment if there were any. An example of these are the word boundaries you chose for the tokenization process.
The results of the assignments.
Remarks about the results if you have any.
In an appendix: the software or the command sequences that you have used for generating the assignment results. If your task is to change a part in a program then list only those parts that your have changed in your report. Do not include complete programs that you did not write yourself in your report.

The report may be written in Swedish or in English. You may hand in a report as a group provided that the size of the group is two.

Notes:

There are three compulsory assignments and one optional one. People that do the first three assignments can get a mark between zero and eight and people that make all assignments can get a mark between zero and ten.
There are no single correct answers for the assignments in this exercise. The result you achieve will depend on commands you have used. Thus you and your neighbor may achieve different results but both results can be right.

TIP: Create a special directory for storing all the files you work with in these lab sessions.

Assignments: 1.1 | 1.2 | 1.3 | 1.4

Assignment 1.1

Generate a list of the ten most common words in the Press 65 corpus (/corpora/Press65/UnixAscii) and count the number of words and the number of different words. Note that the corpus has been divided in several files.

Background information on assignment 1.1: Tokenizing

The first task you have to perform is to divide the corpus in words. This task can be performed by a tokenizer. If you have tokenizer software available you may use it. Otherwise you will have to use UNIX commands for tokenizing the files. Try for example cutting and pasting the following command to a terminal window:

tr ' ' '\n' < /corpora/Press65/UnixAscii/p65.001 | more

The tr (translate) command replaces characters by other characters. This specific instance of the command replaces all spaces by newlines. These two characters have to be put between quotes so that is why you see the four quotes behind tr. The newline is a difficult character too specify and therefore it has a special name: \n (pronounce: backslash n).

Thus this tr command has two arguments (a space and a newline). It processes the file /corpora/Press65/UnixAscii/p65.001. This is specified by the < character followed by the file name. The < followed by a file name means "use this file as input file". Alternatively one can use the > character followed by a file name and this means "use this file as output file".

UNIX is a large set of commands each of which only performs a single task. The power of UNIX comes from the fact that all these commands can be combined. The combining symbol is the | character (pronounce: pipe). The sequence "command1 | command2" means that command1 needs to be executed after which the results of this command should be send to command2. Then command2 will process the results of command1. An example:

tr 'a' 'b' < /corpora/Press65/UnixAscii/p65.001 | tr 'b' 'z'

Here we have one tr command which replaces all a's by b's after which another tr command replaces all b's by z's. The total result will be that all a's (and b's) become z's.

If you try out this command you will see a lot of information running over your screen. If you want to prevent this from happening you have to send your results to the command more by adding the sequence "| more" at the end of this command sequence. The more command is a friendly command that shows its input on the screen and stops when the screen is full. You can make it progress to the next screen by pressing the space bar. You can exit more by pressing the q button. Returning to the top of the file can be achieved by pressing the g button.

Now let's return to the first command sequence which could be used for tokenizing the corpus. This command will put every word of the input file on a different line. We are doing this because we want to process the word list with other UNIX commands. Most of these commands work on lines so it is good to have every word on a different line.

If you examine the result of the command you will see that it has done a reasonable job. However there are some things which could have been done better. One problem is that the results start with the three strings ***, NFOART0002 and another ***. These are tags used in the corpus for marking the beginning of a text and we may not want to include them in our word list. Another problem can be found with the word 'anmärkningsvärd. It contains an initial quote which will prevent the word from being recognized as anmärkningsvärd.

We will work with string processing commands which will make a difference between a word with a quote and a word without a quote. Therefore we want to remove all the quotes from the words. Another problematic word is kraft'.+ which contains a quote, a punctuation mark and the Press 65 sentence boundary character +. All these characters should be removed in order to clean up the words.

There are several UNIX commands which can be used for cleaning up the words. One of these is the command sed (stream editor). This command can be used for deleting characters. An example: sed "s/[.+']//g" substitutes all characters in the set {.+'} with nothing and thus deletes them. The command contains one argument which specifies a task. In this case the task is to substitute (s) the pattern between the first two /'s (pronounce slashes) with the pattern between slash 2 and 3. Since there is nothing between the second and the third slash the first pattern will be deleted. The square brackets specify a character set and the g stands for global replacement: replace every pattern on the line rather than only the first one. Now try:

tr ' ' '\n' < /corpora/Press65/UnixAscii/p65.001 | sed "s/[.+']//g" | more

You will see that the problematic characters have been removed. On some UNIX systems you may have to add extra characters to the command. Contact your teacher if this command generates error messages.

You should be aware of the fact that this command will make some errors. It will delete the quote before the genitive 's in English thus creating erratic words. You should decide for yourself whether you want to use the command in this way or not.

Probably you can think of some other characters you want to delete. Add these to the command to create you own personal tokenizer. If you want to delete complete lines then use something like sed "/NFOART/d". This will delete all lines containing the string NFOART. You should take care not to delete lines that you want to keep. If you want to perform more complex tokenizing commands then please consult your teacher.

The commands shown in this section have been applied to a single file in the corpus. You should however work with all the files in the corpus and this can be accomplished by starting your commands as follows:

more /corpora/Press65/UnixAscii/p* | tr ' ' '\n' | ...

Here the more command is used for sending all corpus files to the tr command after which further processing can take place. The corpus file names start with a p so they can be specified as p*. The * character stands for an arbitrary sequence of characters so p* matches all corpus file names. It is equivalent with p65.001 p65.002 ... p65.022.

More information about UNIX commands can be found in the UNIX books on the local bookshelf. There is also online information available. You can access it by typing "man command" in some terminal window, for example "man sed" to get more information about sed. However this will give you more information than you want.

Background information on assignment 1.1: Counting

When you are satisfied with the output of your tokenizer you will have to retrieve the ten most frequent words. This can be done by appending the following command sequence to your tokenizing commands:

... | sort | uniq -c | sort -nr | more

This command sequence will count lines and list them in order of frequency with the most frequent one first. This is exactly what we need for retrieving the 10 most frequent words from a word list. The command sequence starts with sorting the words according to the alphabet (sort command). The next command uniq will remove all the duplicate occurrences of words.

The uniq command contains a special argument -c with starts with a hyphen. UNIX command arguments which start with a hyphen are called command line options. They change the behavior of the command. In this case the -c is telling uniq to put a counter before each word to state how many times the word was present in the input.

The output of uniq will be sent to another sort command. This command contains two command line options: -n and -r. These can also be written as -nr. The first option states that sort should assume that it sorts numbers. The second option states that the results show be displayed in reverse order. Normally a sorted list of numbers will start with the lowest number and have the highest number at the bottom. We are interested in the most frequent word and therefore we want the highest number to appear on top.

Finding out how many words the corpus contains is not difficult. You can count the words by using the command wc (word count). Send the output of your tokenizer to this command and use the option -w. This will make wc count words. It will respond with the number of words. If you want to know the number of different words in the corpus you can apply a combination of sort, uniq and wc. Sort the words, make uniq (without any option) remove the duplicates from the word list and then count the numbers with wc with with option -w.

Tip: Create a file in which you save your important commands. You can copy them for a terminal window to a file in emacs by using the mouse. Use the command history for getting a list of commands you have performed.

Assignments: 1.1 | 1.2 | 1.3 | 1.4

Assignment 1.2

Make a list of the ten most frequent word bigrams from the first file in the corpus. Determine the number of bigrams in the file and also the number of different bigrams.

Note: You may also try to perform this assignment for the complete corpus but that will require a lot of computer resources. Furthermore you will have to change each invocation of the sort command in "TMPDIR=. sort".

Background information on assignment 1.2

Our goal will be to build a file which contains all the bigrams of the corpus with each bigram on a separate line. As soon as we achieve that we can use the commands of the previous assignment for generating the frequency information we are looking for. We will achieve this result by performing some extra operations on the output of the tokenizer applied to the first file of the corpus. Let's look at the following figure:

                               SHIFTED
 BIGRAM    =   TOKENIZER   +  TOKENIZER 
  FILE           OUTPUT         OUTPUT
 w1  w2            w1             w2
 w2  w3            w2             w3
 w3  w4            w3             w4
 w4  w5            w4             w5
 w5  w6            w5             w6

We can see in the figure that a file containing bigrams is nothing else than the tokenizer output with some shifted version of this output pasted to its right. We need two commands for achieving this result. First we need to create the shifted tokenizer output. This is all the tokenizer output from the second word to the last but without the first word. We can use the command tail for this: tail +n generates the part of the file from line n to the end. Thus we perform the following command sequence:

...tokenizer applied to p65.001... | tail +2 > p65.shifted

Now we save the result of the command in a file. You can choose any file name you like. Be aware that you have created a temporary file of 6000000 bytes. Please delete it when you have finished this practical exercise! We also need the output of the tokenizer in a file so run the tokenizer again and save the complete results in some file, for example one with the name p65.tokens.

The bigram file can be created with the UNIX command paste. This command combines the lines of two files: it puts the lines of the first file on the left in the result with the corresponding lines of the second file on the right. So we perform the command:

paste p65.tokens p65.shifted | more

This will generate the bigram file. Please verify that. After that you can use the counting commands of the previous section for generating the frequency information you are looking for. Note: You need to use the option -l with wc to count lines instead of words.

Tip: If your commands take a lot of time then try to execute them on strindberg which is our fastest machine. By executing the command "remsh stp" in a terminal window you can start a shell on this machine and continue working there.

Assignments: 1.1 | 1.2 | 1.3 | 1.4

Assignment 1.3

Make a list of the ten most frequent word trigrams from the file used in assignment 1.2. Determine the number of trigrams in the text and also the number of different trigrams.

This is an extension of the previous assignment. You know all the necessary UNIX commands for performing this assignment.

Tip: The command sequence that you work with can get quite big. You can collect the commands in a file which you change with emacs and run use this file as a program. Such a collection of commands is called a script. Here is an example.

Assignments: 1.1 | 1.2 | 1.3 | 1.4

Assignment 1.4

The fourth assignment is optional. You don't have to make it in order to pass this practical exercise.

Compute the mutual information of one Swedish word with five different words that follow it or precede it. Then compute the z-scores for a pair of Swedish words with a similar meaning for at least one word that follows or precedes both words. The frequency data for the words should be extracted from the Press 65 corpus or a part of this corpus.

Background information on assignment 1.4

In order to perform this assignment you need frequency data for words and bigrams. You have derived such information in the assignments 1.1 and 1.2. If you save the word list and the bigram list with frequency information in two files then you can search in these files for specific words. One way to do this is with the grep command:

grep "some word" file

This invocation of the grep command contains two arguments. The first is a string and the second is a file. The command will return all lines in the file that contain the string.
Note: The word list and the bigram list do not contain spaces but tabs. It is important that you know this when you are looking for complete words rather than strings. So: grep " på$" p65.bigram finds all the bigrams that contain på as final word (there is a tab character before the word and the dollar after på means end of line).

When you inspect your bigram file you will discover that the interesting bigrams have a low frequency. For getting more interesting bigrams you may try to process the complete corpus with the following commands:

...bigram creation commands for whole corpus...|\
TMPDIR="." sort | uniq -c | TMPDIR="." sort -nr > p65.bigrams

The reason why we did not work with the complete corpus in the assignments 1.2 and 1.3 is that the sort command fails sorting the bigram and the trigram files. Here you have a fix for this problem: define the directory in which sort stores its temporary files as the current directory. Then sort will be able to process the files. This command sequence will take a lot of time to complete: something like 15 minutes on strindberg and more if you work on another machine. Therefore you should save the results in a file so that you only have to run this command once.
Note: this an optional solution: you may also work with the bigram data that you have obtained in assignment 1.2.

Possible interesting word bigrams are combinations of adjectives with nouns. Some frequent adjectives in Swedish are: stor, annan, ny, liten, själv, olik, god, gammal, hel, sådan, mången, egen, viss, hög, svensk, viktig, mycken, lång, flera, sen, ung, vanlig, bra, stark, svår, enda, låg, övrig, sist and politisk (source: part of the SUC corpus). For the z-score you may also use combinations of adjectives which do not have the same meaning but which can be used in similar situations, for example stor and liten. Remember that these adjectives can have different forms in Swedish: stor, stora and stort (in a regular expression: stora*t*).

The computations for mutual information and z-score can be performed with a calculator, for example with the xcalc command. In order to obtain mutual information scores you need the 2log function but your calculator will probably not contain a button for this function. You can compute the function by using the following formula:

             log  (x)
                10
   log (x) = --------
      2      log  (2)
                10

This means that for obtaining the 2log value for some value x you need to enter x's value in the calculator and press the 10log button (often just called log). Then you press divide, then 2, then the log button again and finally the is-equal-to sign (=). You should obtain values between -15 and +15.

Extra optional assignment for people that like programming: If you have experience with one of the programming languages PASCAL, C, C++ or PERL then you may try to adapt this PERL script in such a way that it can be used for extracting a complete list of mutual information scores and z-scores from the corpus file for some word pair.

Last update: October 31, 1997. erikt@stp.ling.uu.se