Språkstatistik HT96:15

Språkstatistik HT96:15

In this practical exercise class you will work with simple statistical methods applied to text corpora. This first practical exercise contains four assignments. In each assignment you are required to generate some output. Make a report of these assignments and include in the report the required output and the code that you have used for generating it. People are allowed to work together but everyone has to send in his or hers own report. The report should contain:

A description of each assignment
The decisions you had to make for each assignment if there were any. An example of these are the word boundaries you chose for the tokenization process.
The results of the assignments.
Remarks about the results if you have any.
In an appendix: the software or the command sequences that you have used for generating the assignment results.

Note: there are three compulsory assignments and one optional one. People that do the first three assignments can get a mark between zero and eight and people that make all assignments can get a mark between zero and ten.

Assignment 1.1

In the lectures you have seen a list containing the ten most frequent words in the file:

/corpora/ICAME/brown1/brown1_a.txt

Make a list of the words with rank 11-20 from this text. Determine the number of words in the text and also the number of different words. Note: the corpus contains some markup information like line numbers. If you do not want to include that in the count you should try to remove it if possible. Alternatively you can try to count the information and argue whether it does not influence the results significantly. You may choose yourself what strategy to use. Hints

Assignment 1.2

Make a list of the ten most frequent word bigrams from the text you have used in the previous assignment. Determine the number of bigrams in the text and also the number of different bigrams. Hints

Assignment 1.3

Make a list of the ten most frequent word trigrams from the text used in assignment 1.1. Determine the number of trigrams in the text and also the number of different trigrams. Hints

Assignment 1.4

The fourth assignment is optional. You don't have to make it in order to pass this practical exercise.

Choose a pair of words that have a similar meaning. The words may be chosen from any language that we have corpora available for. However make sure that there is a reasonable number of examples of the two words in the corpora. You can use your bigram program for looking for interesting words. Now write a Perl script that gives you the t-score for the words that follow your chosen words. Hints

Last update: September 30, 1996. erikt@stp.ling.uu.se