Extraction of Cognates

This is one of the suggested topics for the course Language Technology Project 2005.

Introduction

Cognates are words that are similar in different languages, like table (English) and tafel (Dutch). Currently there are many digital resources like parallel corpora which make possible automatic extraction of cognates for different language pairs.

Task

Design and evaluate different techniques for extracting cognates from parallel resources. The target cognates could be person names, locations, organizations, etc.

The proposed parallel resources are the OPUS corpus and the free online Wikipedia encyclopedia.

Modules

This task will produce one module:

In order to evaluate the module, a small part of the corpus needs to manually annotated. Half of the annotated corpus can be used in the module development phase. The other half needs to be put aside until the final evaluation run.

Literature and tools


Previous topic | Home | Next topic
Last update: January 04, 2005, erikt@science.uva.nl