Erik Tjong Kim Sang
The Syntactic Atlas of the Dutch Dialects (SAND) is an online database of syntactic characteristics of Dutch dialects . The database has been used for different purposes, for example for measuring the syntactic distance between dialects . In this paper, we apply statistical techniques to the data for finding transition zones, areas between dialect regions which display a mix of characteristics of several neighboring regions rather than fitting nicely into one of them.
We build a statistical model of the available data which we anchor with a dialect region definition which has been based on phonetic differences . We restrict the granularity of this definition to five main regions (Flemish, Frisian, Hollandic, Limburgish and Low Saxon) rather than using all 24 dialects, because we only have a limited number of data for the classification.
For the statistical model, we chose the Naive Bayes algorithm . More advanced machine learning methods exist, but Naive Bayes is easy to implement and it can deal with missing data values which appear frequently in our data sets (more than 50% of the data points are undefined).
We only use a subset of the SAND data for this experiment: 22 syntactic variables with binary values. SAND is much larger but we have not been able to find a good way to measure the distances between multi-valued features yet (see also ).
Figure 1: Dutch dialect definition based on phonetic data analysed by Daan and Blok (left) and map with differences based on syntactic features generated by Naive Bayes (right). Colors different from the main color on the right map indicate areas in which the statistical model suggested a different region than the phonetic data.
We converted the SAND data selection in a training file with 22 columns (syntactic features) and 267 rows (locations) and used Naive Bayes to build a model that linked the feature values to the region classes as derived from . Then we applied the statistical model for classifying the 267 locations based on their feature values. The model put 214 locations (80%) in the same region as Daan and Blok but for 53 locations it suggested another region. A map displaying these regions can be found in Figure 1.
Several explanations can be given for the classification differences between the statistical model and the standard map by Daan and Blok. The locations could share characteristics of different regions. It is also possible that phonetic boundaries between Dutch dialects lay on different positions than syntactic boundaries. Either of these explanations suggests that the locations are part of transition zones. It is interesting to see that most of the marked locations are at the boundary of two regions or at the very edge of the map, supporting the proposal that interesting linguistic processes can be observed there.
We have applied a supervised machine learning technique, Naive Bayes, for building a model of dialect regions. Part of the generated model differed from the input model. The locations with multiple region definitions could be part of transition zones. Further study into their linguistic characteristics is required to support this claim.
 Sjef Barbiers, Johan van de Auwera, Hans Bennis, Eefje Boef, Gunther Vogelaer, and Margreet van der Ham. Syntactic Atlas of the Dutch Dialects. Amsterdam University Press, 2008.
 Jo Daan and D.P. Blok. Van randstad tot landrand. Noord-Hollandsche Uitgevers Maatschappij, Amsterdam, The Netherlands, 1969.
 Tom Mitchell. Machine Learning. Mc Graw Hill, 1997.
 Marco René Spruit. Quantitative perspectives on syntactic variation in Dutch dialects. LOT, Utrecht, The Netherlands, 2008.
 Erik Tjong Kim Sang. SAND: Relation between the Database and Printed Maps. Technical report, Meertens Institute, Amsterdam, The Netherlands, June 2014.