BASIC DUTCH NAMED ENTITY RECOGNIZER

2006-01-16


INTRODUCTION: NAMES AND TYPES

This basic named entity recognizer identifies names and classifies
them according to the CoNLL named entity tagging scheme. This means
that four types of entities can be detected: 

1. tag=PER: persons and personified things like animals
2. tag=ORG: organizations
3. tag=LOC: locations
4. tag=MISC: all other names or words (adjectives/compounds) derived from names

See http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt for a more
elaborate overview of what entities fit in which classes.

The software places individual words in the entity classes by
appending to them a slash and a tag. For example, the output for the
sentence "George bezocht New York en sprak de VN toe" could be:

   George/PER bezocht/O New/LOC York/LOC en/O sprak/O de/O VN/ORG toe/O

In the output of the tagger, the tag O is assigned to words that do
not belong to a named entity. We assume that when two words next to
each other receive the same named entity tag, they belong together. 
Unlike in the CoNLL shared task, the tagging scheme used here does 
not allow such words to belong to separate entities. Note: the 
tagger uses two slashes rather than one for unknown words: zxcvb//O


PREPROCESSING: TOKENIZATION

The tagger expects the input to be tokenized. This means that
punctuation signs should be be separated from words and each 
other, like for example in:

   " Hurray ! " , said the happy fellow .

This format makes it easier for the tagger to identify entities
because it works on word level. If a name would be the last word
of a sentence and the tagger would compare it with its lexicon,
it would probably not find it if a period was attached to it.

The software package contains a program for this task (tokenize) 
which can be called like this:

   bin/tokenize < infile > outfile

It reads its input from standard input and prints the output to
standard output.


RUNNING THE NAMED ENTITY TAGGER

The tagger can be called in the same way as the tokenizer:

   bin/ner < infile > outfile

In case the text in the input file needs to be tokenized:

   bin/tokenize < infile | bin/ner > outfile

Please note that neither of the programs can handle HTML or XML. If
you want to process HTML or XML files you will need to remove the 
<> tags before applying the programs to the files.

Internally, four different program make up the named entity tagger:

1. bin/recap
   changes the case of the first word in each sentence (and every word
   in an all-caps sentence) back to the most commonly used format. For
   example "In" becomes "in" and "wto" is changed to "WTO". Having the
   text in this format usually makes the task of finding names easier.
   The most commonly found capitalization is based on a big text 
   corpus (see the word list in etc/recap.clef). The program's word 
   list contains some word bigrams for identifying common cases that 
   rely on context (like the first word of "New York").

2. bin/disguisSlashes
   Since a slash (/) is a special token for the tagger, this program
   changes all slashes in the input to &slash; .

3. bin/tntwrapper
   This program calls the tagger (tnt) while specifying the directory
   with the training data. The tagger uses a lexicon (tnt/train.lex)
   for identifying known named entities and a training corpus (tnt/train)
   for unknown entities that require identification based on features
   of the words in the near context. The file tnt/000README contains
   instructions on how to improve the tagger by updating the lexicon
   and the training data.

4. bin/nerCleanup
   This program corrects some common errors of the tagger, like
   removing unknown lower case words at the end of identified
   entities.


ADDITIONAL SOFTWARE

This software package contains a program for converting the output of
the named entity tagger to colorful HTML:

   bin/tokenize < infile | bin/ner | bin/ner2html > outfile.html


REFERENCES

CoNLL Named Entity Tagging
   http://www.cnts.ua.ac.be/conll2003/ner/
TnT Tagger
   http://www.coli.uni-saarland.de/~thorsten/tnt/


SEND COMMENTS TO

Erik Tjong Kim Sang
erikt@science.uva.nl
