Home | 1 | 2 | 3

 

Statistical NLP: Exercise 1

This is the first of a series of exercises on statistical natural language processing. In this exercise you learn to compute basic statistical features of texts. For this purpose you can use an online search program which processes the novel Dracula by Bram Stoker.

This exercise has been created by Erik Tjong Kim Sang, University of Antwerp, Campus Drie Eiken, room J0.07, phone 03-8202793, e-mail erikt@uia.ua.ac.be


Assignments

Use the online search program for making the following assignments:

  1. Find 5 words with a frequency of 1000 or more.
  2. Find 3 word pairs with a frequency of 200 or more.
  3. Find 1 word trigram with a frequency of 10 or more.
  4. For one word w2 and three different words w1, compute the conditional probability P(w2|previous word is w1). Choose the words in such a way that the probability is larger than zero for at least two pairs.
  5. We choose an arbitrary word from the corpus. What is the probability that the word is "godalming"? And what is the probability that the word is "godalming" given that the previous word is "lord"?


Last update: November 23, 2003. erikt@uia.ua.ac.be