If you are a course participant, please send the solutions to these exercises to erikt(at)science.uva.nl before or on Tuesday October 16, 2007. Do not forget to test your programs. Include the test results in your weekly reports.
All the programs you construct for these exercises must contain the line use strict at the top.
Write a non-empty program that shows itself on the screen by reading the file that it is stored in and printing the lines of the file. Hint: the file that the program is stored in, can be found in the special variable $0 in the main part of the program.
[Solution] Only look at the solution when you have finished the exercise or when you are completely stuck.
Write a program that takes a number of file arguments and generates, for each of these files, the term frequencies. These are the counts of each word in the file divided by the total number words in the file. For example, if the word seldom appears two times in a file with 29 words then its term frequency is 2/29 = 0.069. Please ignore the punctuation signs in the files. As example text files you can use three to five files containing a single limerick. Example run:
$ perl -w 6.2.pl lim1.txt lim2.txt lim3.txt lim1.txt: 0.069 the lim1.txt: 0.069 seldom lim1.txt: 0.069 ones ... lim2.txt: 0.075 the ...
Compute at least one term frequency score by hand to make sure that your program works well. Include the manual computation in your exercise report.
Expand the program of exercise 6.2 and make it generate tf-idf scores which are stored in separate files which have the same name as the input files plus .tfidf . Tf-idf scores are term frequency scores multiplied by inverted document frequency scores. Program 6.2 already computes term frequency scores. Inverse document frequency (idf) scores can be obtained by dividing the number of documents by the number of documents that contain the term and computing the logarithm of the result. For example, if the word seldom appears in one document from a set of three, then the idf score of seldom is log(3/1) = 1.099. Note: Perl contains a built-in function log() (natural logarithm) which can be used in this exercise. Example run:
$ perl -w 6.3.pl lim1.txt lim2.txt lim3.txt $ cat lim1.txt.tfidf 0.000 the 0.076 seldom 0.076 ones ... $ cat lim2.txt.tfidf 0.000 the ...
Compute at least one tf-idf score by hand (include for this in your test set a small file of less than five words) to make sure that your program works well. Include the manual computation in your exercise report.
Write a program that computes document similarity scores based on the terms that the documents contain. The program should take a number of file arguments from the command line and apply the program from exercise 6.3 to these files (with system) in order to create tf-idf files. Then the program should read the tf-idf files and compare their contents by computing the inner product of the vectors stored in the file: the inverse cosine function applied to the sum of the products of the matching vector values in two tf-idf files divided by the product of the lengths of the two vectors. Assume that words which are not in a file, have tf-idf score 0.
Note: you need to use the Perl function acos() and in order to get access to this function you must have the line use Math::Trig; at the top of your program.
# program will compare first document with other documents $ perl -w 6.4.pl lim1.txt lim2.txt lim3.txt 0.23 lim2.txt 0.14 lim3.txt # so the similarity between lim1.txt and lim2.txt is 0.23
Note: this set of exercises only contains four exercises. Exercise 6.4 is worth two points.