New and faster version of

In TwiNL on 11/07/2016 by admin

In the past months Laura Leistikow and Jeroen Schot of SURFsara have worked hard on a new version of which will eventually replace the current one. The most important feature of the new website is that it uses an index for storing the tweets. This means that it can return results for queries in seconds rather than minutes like the current system.

You are welcome to test the new website at:

Feedback on the website can be added to the comments of this post or mailed to erikt(at)


Empty search cache

In TwiNL on 26/03/2015 by admin

On Monday 2 March 2015, the files of were moved to a new Hadoop server. saves the results of queries in several files in a cache directory. In the past two years, this cache directory had grown to 48 million files, causing problems for the system. Therefore the cache directory has not been copied.

For users of this means that they may need to rerun queries in order to access old query results. This may require logging in and pressing the green repeat button (herhaal) in the past query overview. Please contact Erik at if you need any assistance with this.


API for

In TwiNL on 24/06/2014 by admin

An API (Application Programming Interface) is an interface which can be used by computer programs to communicate with other programs. There are several ways for computer programs to start search queries on and retrieve the results. In this blogpost, we give an example which involves the command curl which is available on Macs and computers with the Linux operating system.

Suppose you would like to know what words were used in Dutch tweets that contained the word chili (Dutch for the country Chile) and were written on Monday 23 June 2014, the day that the Dutch soccer team played against Chile in the World Cup. At you can obtain a word cloud by setting the correct search options and pressing the search button. However, you want a program to start the search process and retrieve the results. The following command will achieve this:

   curl -o output.txt '

(the command is supposed to be written on one line) The command will store its output in the file output.txt. It accesses the search program on with four parameters written with capital characters. SEARCH holds the query, in this case the word chili. DATE states in what time frame should be searched: from 0:00 until 23:59 on the 23 of June 2014. It contains the start date and time and the end date and time, both in the format: year, month, day and hour. SHOWCLOUD and DOWNLOAD specify that the word cloud results are requested and that we want them to be downloaded.

When you run this command like this it will finish within a few seconds but probably not give you any results. The reason for this is that results can only be presented after the search command on the web server is finished, which might take a few minutes. The first time you run the command, the web server will start the search process but no results will be returned. You may repeat the command as often as you want. No results will be returned until they are available.

This means that you will need to run the same command several times until it generates output. If the output file does not exist already, it will only be created when the search process is finished. If a search process has no results, the output file will just contain the heading for results of these queries: #token #t-score #frequency-token-in-selection #tokens-in-selection #types-in-selection #frequency-token-in-day #tokens-in-day #types-in-day.

People with a registered account on might want to access the website automatically with the same access level as when being logged in. In order to achieve this, you should supply a cookie with the command, with the curl option --cookie 'cookie=ReplaceWithCookieValue'. A valid cookie value can be obtained by logging into the website as usual and getting the cookie value from the browser, for example in Firefox: Preferences -> Privacy -> Show Cookies -> drop down menu -> cookie -> Content. Cookies are associated with the client machine they were generated for so you can not use them on other machines. Cookies do not expire on, so you can use the same cookie forever.

(note: the ip adress of mentioned in this post may change in the future)


Twiqs in Het Laatste Nieuws

In TwiNL on 21/05/2014 by admin

Het Laatste Nieuws On Wednesday 21 May 2014, Flemish newspapers Het Laatste Nieuws and De Morgen printed a summary of search results from related to the upcoming federal elections in Belgium of 25 May 2014. These include counts of mentions of politicians and parties and analysis of the sentiment of tweets that mention them. The study involves the time frame 1 May 2014 – 16 May 2014 and the full results can be found on this website (in Dutch).

We searched for full names (John Smith) in combination with twitter handles (@johnsmith), which are also frequently used as unambiguous reference to people. For one popular Flemish politician without Twitter handle, an abbreviation of his name (BDW) is used as reference, so we searched for this abbreviation as well. We found the highest number of references to the mayor of Antwerp, Bart De Wever (7556), with Flemish prime minister Kris Peeters as runnerup (6576) and former national prime minister Guy Verhofstadt (3443) in third position. Both De Wever and Peeters attracted considerable attention on Twitter during their televised debate of 11 May. There was even more attention for the debate between De Wever and his rival Paul Magnette on 13 May as can be seen in the attention graphs for De Wever.

The demographs of the people behind the tweets was interesting. Most tweets mentioning politicians were written by men (84%) and people of 26 years and older (63%). Few of the politicians were mentioned often enough to say something about the geographic spread of their fans. De Wever seems to attract attention from the whole of Flanders, with the exception perhaps of North Limburg which was more involved in discussing the local politician Wouter Beke.

We also estimated the sentiment of the tweets mentioning politician. Liesbeth Homans collected the most positive references on Twitter, primarily because of her habit to thank her followers (gratitude is counted as a positive sentiment by our software). Johan Vande Lanotte appeared on the bottom of this list because he was included in one or two popular retweets which were interpreted as negative by our software.

We also counted surnames which have the disadvantage that they could refer to many other people than the target politicians. Yet the same names appeared in the top three: De Wever, Verhofstadt and Peeters. For Verhofstadt it is worth mentioning that his name appears in tweets from outside Belgium, perhaps not surprisingly for someone that is running for the position of President of the European Commission.

In the party counts, the most mentioned names were PVDA, N-VA and Groen. However, PVDA is also the name of a Dutch politician party and Groen means green in Dutch so many tweets we found here will not refer to the Belgian political parties. The map for PVDA clearly displays this problem: many from the tweets with the party name are originating from The Netherlands. Without further analysis it will be difficult to determine accurate counts for these two parties.


Twitter language field

In TwiNL on 09/05/2014 by admin

tower of babel Today we learned about the existence of a Twitter meta data field “lang” that contains the code of the language that a tweet is written in. The contents of the field is estimated by Twitter. Since language identifcation is important for our website (we want to collect all tweets written in Dutch), we immediately compared our own language identification software with the contents of this new data field. As a test we took the first 1000 tweets of our collection from Thursday 8 May 2014 and checked the languages that were assigned to the tweets. With respect to the distinction between Dutch and non Dutch, the two systems reached the same conclusion for 914 tweets. For 86 tweets they did not agree. Next we manually inspected these 86 tweets with the following results:

52 times Twitter assigned the correct language and did not
14 times assigned the correct language and Twitter did not
20 times the language of the tweet could not be decided

From this small test we drew the conclusion that language indentification performed by Twitter performs better than that of with respect to identifying Dutch tweets and that it would not hurt to add a filter for the language to our tweet crawler. Unfortunately it turned out that it is not possible to just ask the Twitter api for all Dutch tweets. A language filter can only be added to another filter like track or follow.

Two steps were required to add the language filter to our standard tweet crawling script. First, the new filter was added on line 110 to the data variable used by the curl call:

my $data = 'language=nl&track='.uri_escape($status);

Next a new block of code was added after line 58 to add the language filter to the signature computation process:

$signature_base .= uri_escape("language") .
        '=' .       
        uri_escape("nl") .

Because of this, our tweet crawler now only returns tweets which are written in Dutch according to Twitter. We will miss all Dutch tweets that Twitter does not recognize as Dutch but we will gain the tweets that our language identification software did not identify as Dutch. The biggest bonus of this new filter is that we download fewer tweets from Twitter: about 50% less. Because of this, the crawler is less likely to hit the maximum download numbers set by Twitter which should mean that we can collect more tweets during peak hours.

Thanks Peter Kleiweg for the language field tip!