CLIN2017 Shared Task: Results

Here are the results of the CLIN2017 Shared Task on Translating Historical Text. Participants built systems which translated seventeenth-century texts written in Dutch to modern Dutch. There were ten test files of about 1000 words, each taken from a different decade of the seventeenth century. System results were evaluated with BLEU scores (10 files) and part-of-speech accuracy (1 file). Results were presented at the conference CLIN2017 on 10 February 2017 [PDF slides].

Best run per participating team

	Team	BLEU	POS accuracy
1.	Ljubljana	0.61029	0.842±0.022
2.	Bochum	0.60665	0.824±0.030
3.	Helsinki	0.59896	0.856±0.029
4.	Amsterdam	0.56783	0.836±0.027
5.	Leuven	0.53752	0.825±0.025
6.	Utrecht	0.46867	0.812±0.030
7.	Groningen	0.45061	0.793±0.024
8.	Valencia	0.43011	0.759±0.030
	baseline	0.33097	0.709±0.030

Score computation is explained below. The highest scores per column are marked with a yellow background.

All BLEU scores

Run	Average	1607	1616	1626	1636	1646	1656	1668	1678	1686	1692
Ljubljana-1	0.61029	0.67732	0.60926	0.57455	0.61950	0.58307	0.56037	0.61012	0.65322	0.55935	0.66524
Bochum-2	0.60665	0.64089	0.61309	0.56817	0.58585	0.44186	0.63273	0.67552	0.70411	0.58466	0.60917
Bochum-1	0.60234	0.65983	0.60315	0.55118	0.61470	0.45921	0.65072	0.65249	0.65674	0.56904	0.60538
Helsinki-5 *	0.59896	0.61667	0.57046	0.58052	0.62991	0.54697	0.60257	0.57868	0.67123	0.54081	0.65617
Helsinki-2	0.58934	0.62086	0.58465	0.53229	0.57905	0.50739	0.56980	0.63734	0.67787	0.57585	0.60725
Helsinki-6 *	0.58672	0.59119	0.57439	0.56342	0.60378	0.52120	0.60417	0.57923	0.67775	0.52111	0.63063
Helsinki-7 *	0.57320	0.59093	0.53448	0.56572	0.62600	0.51191	0.58666	0.55174	0.65023	0.48441	0.62968
Amsterdam-2	0.56783	0.54464	0.52737	0.52678	0.54698	0.55550	0.51895	0.61512	0.62882	0.65683	0.54378
Amsterdam-1	0.56343	0.55414	0.54022	0.51897	0.55200	0.47520	0.56277	0.64077	0.61203	0.62591	0.53840
Leuven-1	0.53752	0.55011	0.47465	0.50595	0.55878	0.54798	0.51725	0.53486	0.59645	0.49087	0.60402
Helsinki-3	0.52014	0.54159	0.49049	0.50284	0.52435	0.43068	0.52654	0.57713	0.55754	0.50634	0.54218
Helsinki-1	0.49451	0.49072	0.43877	0.48839	0.55702	0.44870	0.53486	0.48633	0.56976	0.38622	0.54547
Helsinki-4 *	0.48562	0.51338	0.46029	0.46493	0.52947	0.47724	0.50860	0.49286	0.52398	0.38617	0.50544
Utrecht-1	0.46867	0.49013	0.41701	0.45833	0.49936	0.44293	0.45677	0.43454	0.51246	0.42110	0.55846
Groningen-1	0.45061	0.45711	0.38521	0.40959	0.48100	0.38544	0.48637	0.49893	0.47769	0.41755	0.49965
Valencia-1	0.43011	0.45356	0.33863	0.41330	0.44843	0.33326	0.46862	0.53844	0.45571	0.42832	0.41513
baseline	0.33097	0.30432	0.25332	0.24763	0.33541	0.39338	0.26192	0.37572	0.27629	0.48482	0.34949

Runs are ranked by average BLEU score. Average BLEU scores are computed by concatenating all ten processed test texts and comparing this with the concatenated ten gold standard translated texts. The other ten numeric columns show the BLEU score per text of a particular decade indicated by the year the text was written.

The baseline scores are the result of comparing the unmodified historical texts with the gold standard texts. Runs with a star suffix have not (yet) provided an alignment from the translated tokens to the original tokens, a feature which is useful for using the translation for follow-up tasks like part-of-speech tagging.

All POS accuracies

Run	1616
ceiling	0.875±0.026
Helsinki-2	0.856±0.029
Ljubljana-1	0.842±0.022
Amsterdam-2	0.836±0.027
Amsterdam-1	0.828±0.032
Helsinki-3	0.826±0.025
Leuven-1	0.825±0.025
Bochum-1	0.824±0.030
Helsinki-1	0.813±0.032
Bochum-2	0.812±0.027
Utrecht-1	0.812±0.030
Groningen-1	0.793±0.024
Valencia-1	0.759±0.030
Helsinki-4 *	-
Helsinki-5 *	-
Helsinki-6 *	-
Helsinki-7 *	-
baseline	0.709±0.030

Part-of-speech accuracies were computed on one section of the test data (1616), which was selected because in comparison with the average BLEU scores, the correlation of the BLEU scores of this section were highest. For four runs no POS accuracy could be computed because the translated tokens had not been aligned with the original tokens. The scores behind the ± sign correspond with double standard deviations as estimated with bootstrap resampling. The baseline score is the accuracy achieved by tagging the untranslated text while the ceiling is the accuracy achieved by tagging the gold standard translation and comparing the results with the gold standard POS tags.

Teams

Team	Members
Meertens Institute, Amsterdam	Erik Tjong Kim Sang
Ruhr-Universität Bochum	Marcel Bollmann, Stefanie Dipper, Florian Petran
University of Groningen	Remko Boschker, Rob van der Goot
University of Helsinki	Robert Östling, Eva Pettersson, Jörg Tiedemann
KU Leuven	Tom Vanallemeersch, Leen Sevens
Jožef Stefan Institute, Ljubljana	Nikola Ljubešić, Yves Scherrer
Utrecht University	Marijn Schraagen, Feike Dietz, Marjo van Koppen, Kalliopi Zervanou
Universitat Politècnica de València	Miguel Domingo, Francisco Casacuberta

Last update: 13 February 2017. erikt(at)xs4all.nl