CLIN2017 Shared Task: Results

Here are the results of the CLIN2017 Shared Task on Translating Historical Text. Participants built systems which translated seventeenth-century texts written in Dutch to modern Dutch. There were ten test files of about 1000 words, each taken from a different decade of the seventeenth century. System results were evaluated with BLEU scores (10 files) and part-of-speech accuracy (1 file). Results were presented at the conference CLIN2017 on 10 February 2017 [PDF slides].

Best run per participating team

TeamBLEUPOS accuracyAbstract
1.Ljubljana0.610290.842±0.022PDF abstract
2.Bochum 0.606650.824±0.030PDF abstract
3.Helsinki 0.598960.856±0.029PDF abstract
4.Amsterdam0.567830.836±0.027PDF abstract
5.Leuven 0.537520.825±0.025PDF abstract
6.Utrecht 0.468670.812±0.030PDF abstract
7.Groningen0.450610.793±0.024PDF abstract
8.Valencia 0.430110.759±0.030PDF abstract
baseline0.330970.709±0.030

Score computation is explained below. The highest scores per column are marked with a yellow background.

All BLEU scores

RunAverage1607161616261636164616561668167816861692
Ljubljana-10.610290.677320.609260.574550.619500.583070.560370.610120.653220.559350.66524
Bochum-20.606650.640890.613090.568170.585850.441860.632730.675520.704110.584660.60917
Bochum-10.602340.659830.603150.551180.614700.459210.650720.652490.656740.569040.60538
Helsinki-5 *0.598960.616670.570460.580520.629910.546970.602570.578680.671230.540810.65617
Helsinki-20.589340.620860.584650.532290.579050.507390.569800.637340.677870.575850.60725
Helsinki-6 *0.586720.591190.574390.563420.603780.521200.604170.579230.677750.521110.63063
Helsinki-7 *0.573200.590930.534480.565720.626000.511910.586660.551740.650230.484410.62968
Amsterdam-20.567830.544640.527370.526780.546980.555500.518950.615120.628820.656830.54378
Amsterdam-10.563430.554140.540220.518970.552000.475200.562770.640770.612030.625910.53840
Leuven-10.537520.550110.474650.505950.558780.547980.517250.534860.596450.490870.60402
Helsinki-30.520140.541590.490490.502840.524350.430680.526540.577130.557540.506340.54218
Helsinki-10.494510.490720.438770.488390.557020.448700.534860.486330.569760.386220.54547
Helsinki-4 *0.485620.513380.460290.464930.529470.477240.508600.492860.523980.386170.50544
Utrecht-10.468670.490130.417010.458330.499360.442930.456770.434540.512460.421100.55846
Groningen-10.450610.457110.385210.409590.481000.385440.486370.498930.477690.417550.49965
Valencia-10.430110.453560.338630.413300.448430.333260.468620.538440.455710.428320.41513
baseline0.330970.304320.253320.247630.335410.393380.261920.375720.276290.484820.34949

Runs are ranked by average BLEU score. Average BLEU scores are computed by concatenating all ten processed test texts and comparing this with the concatenated ten gold standard translated texts. The other ten numeric columns show the BLEU score per text of a particular decade indicated by the year the text was written.

The baseline scores are the result of comparing the unmodified historical texts with the gold standard texts. Runs with a star suffix have not (yet) provided an alignment from the translated tokens to the original tokens, a feature which is useful for using the translation for follow-up tasks like part-of-speech tagging.

All POS accuracies

Run1616
ceiling 0.875±0.026
Helsinki-2 0.856±0.029
Ljubljana-10.842±0.022
Amsterdam-20.836±0.027
Amsterdam-10.828±0.032
Helsinki-3 0.826±0.025
Leuven-1 0.825±0.025
Bochum-1 0.824±0.030
Helsinki-1 0.813±0.032
Bochum-2 0.812±0.027
Utrecht-1 0.812±0.030
Groningen-10.793±0.024
Valencia-10.759±0.030
Helsinki-4 *-
Helsinki-5 *-
Helsinki-6 *-
Helsinki-7 *-
baseline 0.709±0.030

Part-of-speech accuracies were computed on one section of the test data (1616), which was selected because in comparison with the average BLEU scores, the correlation of the BLEU scores of this section were highest. For four runs no POS accuracy could be computed because the translated tokens had not been aligned with the original tokens. The scores behind the ± sign correspond with double standard deviations as estimated with bootstrap resampling. The baseline score is the accuracy achieved by tagging the untranslated text while the ceiling is the accuracy achieved by tagging the gold standard translation and comparing the results with the gold standard POS tags.

Teams

TeamMembers
Meertens Institute, AmsterdamErik Tjong Kim Sang
Ruhr-Universität BochumMarcel Bollmann, Stefanie Dipper, Florian Petran
University of GroningenRemko Boschker, Rob van der Goot
University of HelsinkiRobert Östling, Eva Pettersson, Jörg Tiedemann
KU LeuvenTom Vanallemeersch, Leen Sevens
Jožef Stefan Institute, LjubljanaNikola Ljubešić, Yves Scherrer
Utrecht UniversityMarijn Schraagen, Feike Dietz, Marjo van Koppen, Kalliopi Zervanou
Universitat Politècnica de ValènciaMiguel Domingo, Francisco Casacuberta

Last update: 13 February 2017. erikt(at)xs4all.nl