Extraction of Cognates

This is one of the suggested topics for the course Language Technology Project 2005.

Introduction

Cognates are words that are similar in different languages, like table (English) and tafel (Dutch). Currently there are many digital resources like parallel corpora which make possible automatic extraction of cognates for different language pairs.

Task

Design and evaluate different techniques for extracting cognates from parallel resources. The target cognates could be person names, locations, organizations, etc.

The proposed parallel resources are the OPUS corpus and the free online Wikipedia encyclopedia.

Modules

This task will produce one module:

Cognate extraction module: uses different algorithms for extracting extracting cognates from parallel corpora (two to three persons).
Relevant background knowledge: natural language processing and basic programming skills.

In order to evaluate the module, a small part of the corpus needs to manually annotated. Half of the annotated corpus can be used in the module development phase. The other half needs to be put aside until the final evaluation run.

Literature and tools

Joerg Tiedemann, Word to word alignment strategies. In: Proceedings of COLING 2004, Geneva, Switzerland, 2004. [pdf]

Previous topic | Home | Next topic

Last update: January 04, 2005, erikt@science.uva.nl