In this practical exercise class you will work with simple statistical methods applied to text corpora. This first practical exercise contains four assignments. In each assignment you are required to generate some output. Make a report of these assignments and include in the report the required output and the code that you have used for generating it. People are allowed to work together but everyone has to send in his or hers own report. The report should contain:
Note: there are three compulsory assignments and one optional one. People that do the first three assignments can get a mark between zero and eight and people that make all assignments can get a mark between zero and ten.
In the lectures you have seen a list containing the ten most frequent words in the file:
/corpora/ICAME/brown1/brown1_a.txt
Make a list of the words with rank 11-20 from this text. Determine the number of words in the text and also the number of different words. Note: the corpus contains some markup information like line numbers. If you do not want to include that in the count you should try to remove it if possible. Alternatively you can try to count the information and argue whether it does not influence the results significantly. You may choose yourself what strategy to use. Hints
Make a list of the ten most frequent word bigrams from the text you have used in the previous assignment. Determine the number of bigrams in the text and also the number of different bigrams. Hints
Make a list of the ten most frequent word trigrams from the text used in assignment 1.1. Determine the number of trigrams in the text and also the number of different trigrams. Hints
The fourth assignment is optional. You don't have to make it in order to pass this practical exercise.
Choose a pair of words that have a similar meaning. The words may be chosen from any language that we have corpora available for. However make sure that there is a reasonable number of examples of the two words in the corpora. You can use your bigram program for looking for interesting words. Now write a Perl script that gives you the t-score for the words that follow your chosen words. Hints