Twitter language field

In TwiNL on 09/05/2014 by admin

tower of babel Today we learned about the existence of a Twitter meta data field “lang” that contains the code of the language that a tweet is written in. The contents of the field is estimated by Twitter. Since language identifcation is important for our website (we want to collect all tweets written in Dutch), we immediately compared our own language identification software with the contents of this new data field. As a test we took the first 1000 tweets of our collection from Thursday 8 May 2014 and checked the languages that were assigned to the tweets. With respect to the distinction between Dutch and non Dutch, the two systems reached the same conclusion for 914 tweets. For 86 tweets they did not agree. Next we manually inspected these 86 tweets with the following results:

52 times Twitter assigned the correct language and did not
14 times assigned the correct language and Twitter did not
20 times the language of the tweet could not be decided

From this small test we drew the conclusion that language indentification performed by Twitter performs better than that of with respect to identifying Dutch tweets and that it would not hurt to add a filter for the language to our tweet crawler. Unfortunately it turned out that it is not possible to just ask the Twitter api for all Dutch tweets. A language filter can only be added to another filter like track or follow.

Two steps were required to add the language filter to our standard tweet crawling script. First, the new filter was added on line 110 to the data variable used by the curl call:

my $data = 'language=nl&track='.uri_escape($status);

Next a new block of code was added after line 58 to add the language filter to the signature computation process:

$signature_base .= uri_escape("language") .
        '=' .       
        uri_escape("nl") .

Because of this, our tweet crawler now only returns tweets which are written in Dutch according to Twitter. We will miss all Dutch tweets that Twitter does not recognize as Dutch but we will gain the tweets that our language identification software did not identify as Dutch. The biggest bonus of this new filter is that we download fewer tweets from Twitter: about 50% less. Because of this, the crawler is less likely to hit the maximum download numbers set by Twitter which should mean that we can collect more tweets during peak hours.

Thanks Peter Kleiweg for the language field tip!

2 Responses to “Twitter language field”

  1. If I remember correcty, the lang field is not meant as a guess as to what language the tweet is in, but rather the language the Twitter user has set in their profile.

  2. Yes, there are two “lang” fields in the tweets: one for the guessed language in the text and another for the language used in the interface of the user. If a tweet would be encoded like {“lang”:”en”,”user:{“lang”:”fr”}} then the tweet text is in English according to Twitter and the user has set his interface in French.

    This blog post is about the “lang” field used for the tweet text.

Leave a Reply