CLIN2017 Shared Task: Translating Historical Text

Historical texts pose a challenge for automatic text processing tools because the words in the text are spelled in a different way in comparison with their modern equivalents and because spelling may be inconsistent. One method of solving this problem is to translate the historical texts to a version in current language and then apply the text processing tools to this modern version. This shared task focuses on this translation task, in particular applied to documents written in seventeenth-century Dutch.

News

20170213
Updated BLEU evaluation script (run as: bleu -s processed gold). BLEU scores of runs Ljubljana-1, Utrecht-1 and Helsinki-4 changed slightly
20170210
The results of the shared task are available as well as the overview talk slides presented at the conference CLIN2017.
See more/less news

Task description

Seventeenth-century Dutch is similar to modern Dutch but the differences between the two are large enough to cause problems for automatic text processing tools. For example, the sentence:

De honger nu wert swaer in dat lant
contains three old words which are currently spelled differently (marked in red). The task is to translate such sentences to their modern equivalents, for example to:
De honger nu werd zwaar in dat land

You may use any method for performing the translation, provided that it is automatic and can process large texts in a reasonable time. For example, a translation lexicon would be very useful for this task. Such a lexicon exists (select "Get lemma") but it returns modern lemmas rather than modern word forms and leaves disambiguation to the user.

The translations can be used for followup linguistic processing, for example for assigning part-of-speech tags to the words. For this purpose, it is important that it is clear to which word in the original text a translated word corresponds. This can be achieved with additional meta data, for example encoded in XML, or by keeping the same word order in the translation as in the original text, i.e. by performing a word-by-word translation.

Software and data

We provide software and access to data which can be used for starting with the shared task:

This includes the following software:

Note that the target language for the shared task is 21st century Dutch. The shared task data contains a 19th century text as target because that was the closest available text in a parallel pair. You are free to use other texts.

Participants can use the software and data as a base for developing their system. Additional test data sets will be released two weeks before the conference for testing the final version of the systems.

Schedule

September 2016
Call for participation
Monday 30 January 2017 12:00
Release of the test data sets
Friday 3 February 2017 23:59
Deadline for the submission of test results. Send them by email to Erik Tjong Kim Sang: erikt(at)xs4all.nl or erikt.tjong.kim.sang(at)meertens.knaw.nl
Friday 10 February 2017 10:30
Overview talk at CLIN2017. Poster presentations of participants.
May 2017
Submission of overview paper to CLIN Journal

Contact

Erik Tjong Kim Sang (Meertens Institute Amsterdam) erik.tjong.kim.sang(at)meertens.knaw.nl

References

Marcel Bollmann, Florian Petran and Stefanie Dipper. Rule-based normalization of historical texts. In: Proceedings of the International Workshop on Language Technologies for Digital Humanities and Cultural Heritage at RANLP 2011, Hissar, Bulgaria, 2011.

Eckhard Bick and Marcos Zampieri. Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary. In: Text, Speech, and Dialogue, Lecture Notes in Computer Science, Volume 9924, pp. 3-11, 2016.

Hans van Halteren and Margit Rem. Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters. In: Language Resources and Evaluation, Volume 47, Issue 4, pp. 1233-1259, December 2013.

Instituut voor Nederlandse Lexicografie, Lexicon Service. 2015.

Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu, BLEU: a method for Automatic Evaluation of Machine Translation. In: Proceedings of ACL 2002. Association for Computational Linguistics, Philadelphia PA, 2002, pp. 311-318.

Roland Meertens, Old Dutch spelling to new Dutch spelling. Blogpost at github.com, 11 January 2017.

Eva Pettersson, Beáta Megyesi and Jörg Tiedemann, An SMT Approach to Automatic Annotation of Historical Text. In: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013, NEALT Proceedings Series 18 / Linköping Electronic Conference Proceedings 87, pp. 54-69, 2013. (PhD thesis)

Michael Piotrowski, Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies 5.2, 2012.

Nicoline van der Sijs, Chronologisch woordenboek: De ouderdom en herkomst van onze woorden en betekenissen. Veen, Amsterdam/Antwerpen, 2001 (in Dutch; interesting word list with years in Woordregister).

Erik Tjong Kim Sang, Improving Part-of-Speech Tagging of Historical Text by First Translating to Modern Text. In: 2nd IFIP International Workshop on Computational History and Data-Driven Humanities, editors: Bozic, Mendel-Gleason, Debruyne and O'Sullivan, Springer Verlag, 2016.

Tessa Wijckmans and Wouter van Elburg, Adapting NLP-tools for Creating an Orthographic Layer for Early Modern Dutch Texts. In Proceedings of DHBenelux 2016, Esch-sur-Alzette, Luxemburg, 2016.


Last update: 13 February 2017. erik.tjong.kim.sang(at)meertens.knaw.nl