CLIN2017 Shared Task: Results
Here are the results of the CLIN2017 Shared Task on Translating Historical Text. Participants built systems which translated seventeenth-century texts written in Dutch to modern Dutch. There were ten test files of about 1000 words, each taken from a different decade of the seventeenth century. System results were evaluated with BLEU scores (10 files) and part-of-speech accuracy (1 file). Results were presented at the conference CLIN2017 on 10 February 2017 [PDF slides].
Best run per participating team
Score computation is explained below. The highest scores per column are marked with a yellow background.
All BLEU scores
Run | Average | 1607 | 1616 | 1626 | 1636 | 1646 | 1656 | 1668 | 1678 | 1686 | 1692 |
---|---|---|---|---|---|---|---|---|---|---|---|
Ljubljana-1 | 0.61029 | 0.67732 | 0.60926 | 0.57455 | 0.61950 | 0.58307 | 0.56037 | 0.61012 | 0.65322 | 0.55935 | 0.66524 |
Bochum-2 | 0.60665 | 0.64089 | 0.61309 | 0.56817 | 0.58585 | 0.44186 | 0.63273 | 0.67552 | 0.70411 | 0.58466 | 0.60917 |
Bochum-1 | 0.60234 | 0.65983 | 0.60315 | 0.55118 | 0.61470 | 0.45921 | 0.65072 | 0.65249 | 0.65674 | 0.56904 | 0.60538 |
Helsinki-5 * | 0.59896 | 0.61667 | 0.57046 | 0.58052 | 0.62991 | 0.54697 | 0.60257 | 0.57868 | 0.67123 | 0.54081 | 0.65617 |
Helsinki-2 | 0.58934 | 0.62086 | 0.58465 | 0.53229 | 0.57905 | 0.50739 | 0.56980 | 0.63734 | 0.67787 | 0.57585 | 0.60725 |
Helsinki-6 * | 0.58672 | 0.59119 | 0.57439 | 0.56342 | 0.60378 | 0.52120 | 0.60417 | 0.57923 | 0.67775 | 0.52111 | 0.63063 |
Helsinki-7 * | 0.57320 | 0.59093 | 0.53448 | 0.56572 | 0.62600 | 0.51191 | 0.58666 | 0.55174 | 0.65023 | 0.48441 | 0.62968 |
Amsterdam-2 | 0.56783 | 0.54464 | 0.52737 | 0.52678 | 0.54698 | 0.55550 | 0.51895 | 0.61512 | 0.62882 | 0.65683 | 0.54378 |
Amsterdam-1 | 0.56343 | 0.55414 | 0.54022 | 0.51897 | 0.55200 | 0.47520 | 0.56277 | 0.64077 | 0.61203 | 0.62591 | 0.53840 |
Leuven-1 | 0.53752 | 0.55011 | 0.47465 | 0.50595 | 0.55878 | 0.54798 | 0.51725 | 0.53486 | 0.59645 | 0.49087 | 0.60402 |
Helsinki-3 | 0.52014 | 0.54159 | 0.49049 | 0.50284 | 0.52435 | 0.43068 | 0.52654 | 0.57713 | 0.55754 | 0.50634 | 0.54218 |
Helsinki-1 | 0.49451 | 0.49072 | 0.43877 | 0.48839 | 0.55702 | 0.44870 | 0.53486 | 0.48633 | 0.56976 | 0.38622 | 0.54547 |
Helsinki-4 * | 0.48562 | 0.51338 | 0.46029 | 0.46493 | 0.52947 | 0.47724 | 0.50860 | 0.49286 | 0.52398 | 0.38617 | 0.50544 |
Utrecht-1 | 0.46867 | 0.49013 | 0.41701 | 0.45833 | 0.49936 | 0.44293 | 0.45677 | 0.43454 | 0.51246 | 0.42110 | 0.55846 |
Groningen-1 | 0.45061 | 0.45711 | 0.38521 | 0.40959 | 0.48100 | 0.38544 | 0.48637 | 0.49893 | 0.47769 | 0.41755 | 0.49965 |
Valencia-1 | 0.43011 | 0.45356 | 0.33863 | 0.41330 | 0.44843 | 0.33326 | 0.46862 | 0.53844 | 0.45571 | 0.42832 | 0.41513 |
baseline | 0.33097 | 0.30432 | 0.25332 | 0.24763 | 0.33541 | 0.39338 | 0.26192 | 0.37572 | 0.27629 | 0.48482 | 0.34949 |
Runs are ranked by average BLEU score. Average BLEU scores are computed by concatenating all ten processed test texts and comparing this with the concatenated ten gold standard translated texts. The other ten numeric columns show the BLEU score per text of a particular decade indicated by the year the text was written.
The baseline scores are the result of comparing the unmodified historical texts with the gold standard texts. Runs with a star suffix have not (yet) provided an alignment from the translated tokens to the original tokens, a feature which is useful for using the translation for follow-up tasks like part-of-speech tagging.
All POS accuracies
Run | 1616 |
---|---|
ceiling | 0.875±0.026 |
Helsinki-2 | 0.856±0.029 |
Ljubljana-1 | 0.842±0.022 |
Amsterdam-2 | 0.836±0.027 |
Amsterdam-1 | 0.828±0.032 |
Helsinki-3 | 0.826±0.025 |
Leuven-1 | 0.825±0.025 |
Bochum-1 | 0.824±0.030 |
Helsinki-1 | 0.813±0.032 |
Bochum-2 | 0.812±0.027 |
Utrecht-1 | 0.812±0.030 |
Groningen-1 | 0.793±0.024 |
Valencia-1 | 0.759±0.030 |
Helsinki-4 * | - |
Helsinki-5 * | - |
Helsinki-6 * | - |
Helsinki-7 * | - |
baseline | 0.709±0.030 |
Part-of-speech accuracies were computed on one section of the test data (1616), which was selected because in comparison with the average BLEU scores, the correlation of the BLEU scores of this section were highest. For four runs no POS accuracy could be computed because the translated tokens had not been aligned with the original tokens. The scores behind the ± sign correspond with double standard deviations as estimated with bootstrap resampling. The baseline score is the accuracy achieved by tagging the untranslated text while the ceiling is the accuracy achieved by tagging the gold standard translation and comparing the results with the gold standard POS tags.
Teams
Team | Members |
---|---|
Meertens Institute, Amsterdam | Erik Tjong Kim Sang |
Ruhr-Universität Bochum | Marcel Bollmann, Stefanie Dipper, Florian Petran |
University of Groningen | Remko Boschker, Rob van der Goot |
University of Helsinki | Robert Östling, Eva Pettersson, Jörg Tiedemann |
KU Leuven | Tom Vanallemeersch, Leen Sevens |
Jožef Stefan Institute, Ljubljana | Nikola Ljubešić, Yves Scherrer |
Utrecht University | Marijn Schraagen, Feike Dietz, Marjo van Koppen, Kalliopi Zervanou |
Universitat Politècnica de València | Miguel Domingo, Francisco Casacuberta |
Last update: 13 February 2017. erikt(at)xs4all.nl