This is one of the suggested topics for the course Language Technology Project 2005.
Cognates are words that are similar in different languages, like table (English) and tafel (Dutch). Currently there are many digital resources like parallel corpora which make possible automatic extraction of cognates for different language pairs.
Design and evaluate different techniques for extracting cognates from parallel resources. The target cognates could be person names, locations, organizations, etc.
The proposed parallel resources are the OPUS corpus and the free online Wikipedia encyclopedia.
This task will produce one module:
In order to evaluate the module, a small part of the corpus needs to manually annotated. Half of the annotated corpus can be used in the module development phase. The other half needs to be put aside until the final evaluation run.