Machine Learning of Phonotactics

This location contains information related to the thesis Machine Learning of Phonotactics by Erik F. Tjong Kim Sang, published at the University of Groningen, The Netherlands in 1998.

Thesis files

The thesis is available as a collection of Postscript files and a collection of PDF files.

The thesis defense took place at October 19, 1998 in Groningen, The Netherlands.


In this thesis three different learning algorithms are applied to one language problem: the recognition of the phonotactic structure of monosyllabic Dutch words. The learning methods used are Hidden Markov Models (a statistical method), Simple Recurrent Networks (a connectionist method) and Inductive Logic Programming (a rule-based method). The thesis project aimed at answering three questions:

The learning algorithms were supplied with the training data in two representation formats: orthographic and phonetic. They received the data as a sequence of character pairs (bigrams). Two different versions of each experiment have been performed: one in which the algorithms were supplied with some basic phonotactic knowledge and one without such initial knowledge. The algorithms were tested with unseen positive test data, which they should approve of, and negative test data, which they should reject.

Hidden Markov Models and Inductive Logic Programming performed well on this problem. Simple Recurrent Networks performed poorly. The best scores for orthographic data have been obtained by Inductive Logic Programming with linguistic initialization. It accepted 97.8% of the positive orthographic data and rejected 97.7% of the negative orthographic data [97.8%,97.7%]. The best scores for phonetic data have been obtained by Hidden Markov Models with linguistic initialization: [99.1%,99.1%].

The complexity of the data was later on measured with a baseline method which accepts every word that consists of bigrams that appear in the training data. This method obtained [99.2%,60.2%] for orthographic data and [99.2%,71.6%] for phonetic data. The two best learning methods performed slightly worse for positive data but they did a lot better for negative data.

The algorithms produced better models for phonetic data than for orthographic data. The complexity of the phonetic data is larger than that of the orthographic data. However the former is closer to the speech signal and therefore it is easier to recognize regularities in this type of data.

The algorithms which were supplied with initial knowledge outperformed the ones without such knowledge. This result gives some empirical support for child language acquisition theories which assume children rely on innate knowledge for learning languages.

The data files used in these experiments are available upon request. For more information, send e-mail to erikt(at)

The image on this page was drawn by Bill Waterson for the book Scientific Progress Goes "Boink" in the series Calvin and Hobbes (Andrews and McMeel, Kansas, 1991).
Last update: March 31, 2016. erikt(at)