Språkstatistik HT95:13

Språkstatistik HT95:13

Class:   13
Date:    951010
Topic:   Practical Exercise 2

Practical Exercise 2

In this practical exercise you will apply the clustering algorithms mentioned in Stephen Finch's thesis for dividing characters into different classes. You will have to make a little report about this practical exercise and send it by e-mail to erikt@strindberg.ling.uu.se You can send me your report until Monday November 6. Reports which are sent in after that date will receive a half point penalty per extra day.

In this exercise we will use is the book Alice's Adventures in Wonderland by Lewis Carroll (obtained from Project Runeberg). You can find all this text and programs I describe here in the directory /usr/users/staff/erikt/P/ss95/pex2

You can print this exercise by choosing the print option from your web browser while viewing this text.

There are two programs available for doing this exercise. The first is the program makeClusterData which is a UNIX script. makeClusterData analyses the input data, divides it in seperate characters and makes a continguency table for these characters. It will generate a table with each table element on a separate line. If you want to examine the table, try:
./makeClusterData < alice30.ch2|./makeReadable
You might have to adjust your window size to be able to view the complete table. Try to read the makeClusterData program and check if you understand the commands used in there.
The second program that is available for this exercise is the cluster. This program is written in C. If you like reading C, you can read the program but that is not really necessary. cluster will cluster data that is presented in a continguency table. What is important about this program is that you can use it with four different options:

cluster -e = Use Euclidean metric
cluster -m = Use Manhatten metric
cluster -n = Normalize representations
cluster -s = Use Spearman Rank metric

You can use any combination of these four options. For example if you want to use the Manhatten metric combined with representation normalization you would want to use the commands:
./makeClusterData < alice30.txt|./cluster -n -m
If you do not specify any options, the Euclidean metric without representation normalization will be used.
The assignment in this exercise is to check which of the three metrics is best for clustering characters. The most important question is if the clustering algorithms are able to recognize the difference between vowels and consonants. You should also check if normalization of representation vectors will aid the clustering algorithms.
People that have time left after doing this exercise can think about how the makeClusterData should be changed for clustering words instead of characters. The cluster program is already able to do that but we will have to find some way of making the continguency table. This assignment is optional. You don't have to do it to pass this exercise.

Send your reports to erikt@strindberg.ling.uu.se until Monday November 6 6. If you have any questions please ask me.

Last update: October 17, 1995. erikt@strindberg.ling.uu.se