CLIN2017 Shared Task: Translating Historical Text
Historical texts pose a challenge for automatic text processing tools because the words in the text are spelled in a different way in comparison with their modern equivalents and because spelling may be inconsistent. One method of solving this problem is to translate the historical texts to a version in current language and then apply the text processing tools to this modern version. This shared task focuses on this translation task, in particular applied to documents written in seventeenth-century Dutch.
- Updated BLEU evaluation script (run as: bleu -s processed gold). BLEU scores of runs Ljubljana-1, Utrecht-1 and Helsinki-4 changed slightly
- The results of the shared task are available as well as the overview talk slides presented at the conference CLIN2017.
- Eight teams taken part in the shared task. The results will be presented at the conference CLIN2017 in Leuven, Belgium on Friday 10 February.
- The test files for the CLIN2017 shared task have been made available on Monday 30 January. The deadline for submitting the results for the task is Friday 3 February 23:59 CET (21:59 GMT). Send your results with a 200-300 word method abstract text by mail to Erik Tjong Kim Sang: erikt(at)xs4all.nl or erikt.tjong.kim.sang(at)meertens.knaw.nl
- Updated file blankaart.parallel.txt (8 tagger classifications changed)
- Added file with annotation rules and updated file blankaartT.tok (see 2016.zip).
- Aligned additional 1657 bible version with the other bible versions as additional training material. Created a copy of all shared task files on github.
- Updated crawler script bin/dbnl2txt to reflect change at website of 1637 bible.
- Updated tokenization of line 23 of file example file blankaartT.tok (added four underscores).
- Updated baseline scores in 000README of the main zipfile to 0.13464 and 0.50818.
- Changed script tokenize to handle all possible variants of 't in the same way (line 21).
- Released an example test file for translation of 17th century Dutch to 21st century Dutch: 2016.zip. This zipfile includes a README file.
- Updated script tokenize: it now considers 't as one token
- Added Bick & Zampieri (2016) to the References list.
- The evaluation script has been changed. Please update your version with the new script. You can also find the new version in the latest zipfile of the software below. Note that the baseline score has now moved up from 0.41427 to 0.50606.
- Added Petterson et al. (2013) and Papineni et al. (2002) to the References list.
- The software was replaced with a version which does not require the package recode. As a result the baseline score dropped slightly, from 0.41492 to 0.41427.
- The required package recode may not be available as widely as we thought. We are looking into this. Thanks to Joachim Van den Bogaert for reporting this.
- Added Van Halteren & Rem (2013) to References list.
Seventeenth-century Dutch is similar to modern Dutch but the differences between the two are large enough to cause problems for automatic text processing tools. For example, the sentence:
De honger nu wert swaer in dat lantcontains three old words which are currently spelled differently (marked in red). The task is to translate such sentences to their modern equivalents, for example to:
De honger nu werd zwaar in dat land
You may use any method for performing the translation, provided that it is automatic and can process large texts in a reasonable time. For example, a translation lexicon would be very useful for this task. Such a lexicon exists (select "Get lemma") but it returns modern lemmas rather than modern word forms and leaves disambiguation to the user.
The translations can be used for followup linguistic processing, for example for assigning part-of-speech tags to the words. For this purpose, it is important that it is clear to which word in the original text a translated word corresponds. This can be achieved with additional meta data, for example encoded in XML, or by keeping the same word order in the translation as in the original text, i.e. by performing a word-by-word translation.
We provide software and access to data which can be used for starting with the shared task:
This includes the following software:
- Scripts for downloading two versions of the Dutch Statenvertaling bible (1637 and 1888). Note that these texts are not free of rights. You are not supposed to share them with others.
- Scripts for tokenzing the texts, aligning them, extracting a translation lexicon and translating the 1637 text to the language of 1888 with the lexicon.
- A script for evaluating the results.
Note that the target language for the shared task is 21st century Dutch. The shared task data contains a 19th century text as target because that was the closest available text in a parallel pair. You are free to use other texts.
Participants can use the software and data as a base for developing their system. Additional test data sets will be released two weeks before the conference for testing the final version of the systems.
- September 2016
- Call for participation
- Monday 30 January 2017 12:00
- Release of the test data sets
- Friday 3 February 2017 23:59
- Deadline for the submission of test results. Send them by email to Erik Tjong Kim Sang: erikt(at)xs4all.nl or erikt.tjong.kim.sang(at)meertens.knaw.nl
- Friday 10 February 2017 10:30
- Overview talk at CLIN2017. Poster presentations of participants.
- May 2017
- Submission of overview paper to CLIN Journal
Erik Tjong Kim Sang (Meertens Institute Amsterdam) erik.tjong.kim.sang(at)meertens.knaw.nl
Marcel Bollmann, Florian Petran and Stefanie Dipper. Rule-based normalization of historical texts. In: Proceedings of the International Workshop on Language Technologies for Digital Humanities and Cultural Heritage at RANLP 2011, Hissar, Bulgaria, 2011.
Eckhard Bick and Marcos Zampieri. Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary. In: Text, Speech, and Dialogue, Lecture Notes in Computer Science, Volume 9924, pp. 3-11, 2016.
Hans van Halteren and Margit Rem. Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters. In: Language Resources and Evaluation, Volume 47, Issue 4, pp. 1233-1259, December 2013.
Instituut voor Nederlandse Lexicografie, Lexicon Service. 2015.
Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu, BLEU: a method for Automatic Evaluation of Machine Translation. In: Proceedings of ACL 2002. Association for Computational Linguistics, Philadelphia PA, 2002, pp. 311-318.
Roland Meertens, Old Dutch spelling to new Dutch spelling. Blogpost at github.com, 11 January 2017.
Eva Pettersson, Beáta Megyesi and Jörg Tiedemann, An SMT Approach to Automatic Annotation of Historical Text. In: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013, NEALT Proceedings Series 18 / Linköping Electronic Conference Proceedings 87, pp. 54-69, 2013. (PhD thesis)
Michael Piotrowski, Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies 5.2, 2012.
Nicoline van der Sijs, Chronologisch woordenboek: De ouderdom en herkomst van onze woorden en betekenissen. Veen, Amsterdam/Antwerpen, 2001 (in Dutch; interesting word list with years in Woordregister).
Erik Tjong Kim Sang, Improving Part-of-Speech Tagging of Historical Text by First Translating to Modern Text. In: 2nd IFIP International Workshop on Computational History and Data-Driven Humanities, editors: Bozic, Mendel-Gleason, Debruyne and O'Sullivan, Springer Verlag, 2016.
Tessa Wijckmans and Wouter van Elburg, Adapting NLP-tools for Creating an Orthographic Layer for Early Modern Dutch Texts. In Proceedings of DHBenelux 2016, Esch-sur-Alzette, Luxemburg, 2016.
Last update: 13 February 2017. erik.tjong.kim.sang(at)meertens.knaw.nl