Skip to content

emanjavacas/urban-tweeters

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Urban Tweeters

This is a project based on Quil and Unfolding for Processing for visualizing urban multilingualism through Twitter data. The main goal of this project is to collect a Twitter corpus that provides detailed georeference and language information for tweets located in urban scenarios. Parallel to the dataset, we are releasing an application that allow to visualize the data on map in different ways. The novum of this dataset relies in that the data collection was restricted to four selected urban scenarios. These scenarios are, in alphabetical order, Amsterdam, Antwerp, Berlin and Brussels.

Data

General dataset

The general dataset consists of a collection of tweets directly retrieved from the Twitter Streaming API since December 2014. Given the nature of the dataset, only tweets with geolocation information were collected. According to Leetaru et al. 2012, only 1,6% of the Twitter stream is actually shipped with geolocation information. This is a heavy constraint on the top of the general twitter streaming limit of ~1%. Currently, the dataset has grown above 2 million tweets with the approximately following proportions (7/6/2015).

CityNumber of tweets
Amsterdam679205
Antwerp415813
Berlin691998
Brussels497667

Berlin dataset

For the Berlin dataset, non-exhaustive bot detection was semi-manually performed with the aid of Bot or Not? and a set of heuristics based on profile information. A preselection of candidates was done by sorting ids by (i) total number of tweets in the database and (ii) total number of statuses. The rationale behind this strategy is twofold:

  • First, bots are known to have a more productive tweeting behaviour than humans (Chu et al. 2010).
  • Secondly, bots are known to have a more evenly distributed tweeting behaviour across time than humans. That means that in periods of the week of less human tweeting activity (night and weekends), proportionally more bot-authored tweets will be captured by the stream.

Dataset expansion

Once a sufficient number of known users were collected, a parallel tweet collection method was applied. This consisted in selectively retrieving tweets for the known ids, using the RESTful API for mining user timelines.

Release

Currently there are datasets available for Berlin, Brussels, Antwerp and Amsterdam. Due to the Twitter API Terms of Service, only tweet id and the language identification postprocessing. In order to fetch the data run this on your command line:

git clone https://github.com/emanjavacas/urban_dataset

You can use the REST API of Twitter to retrieve the actual tweets and locations. We plan to publish a script that does precisely this for you.

Visualization tool

./img/antwerp.png

The visualization tool consists of a menu frame and the actual visualization frame. The menu frame allows selecting the specific settings in which the visualization will take place.

Settings

Settings are best navigated using the panel on the left-hand side of the menu.

./img/init_menu.png

Grid files

The visualization doesn’t rely on the actual tweet data (which would be quite computationally heavy) but rather on a so called grid file. This file is an aggregation of datapoints to cells of a given size. As a result each cell contains a series of numbers indicating the amount of tweets written in a given language. A number of precompiled grid files are shipped with the program and a future version will provide functionality for creating such grid files out of raw twitter json data.

A description of each grid file is being shown inline at the bottom of the menu frame.

Screen settings

You can also select the width and height of the visualization frame. The two resting options loc? and filter are somehow more specific. If loc? is activated, the exact coordinates where the mouse is pointing will be shown on the map. filter influences the number of languages that the user can choose from at run time. A higher value will prevent rare languages from appearing in the dropdown list. It doesn’t affect the application behaviour when running on monolingual mode.

Visualization mode

./img/berlin.png

There are three types of visualization.

Monolingual

Visualization is carried out in a heat map fashion. Color is mapped to the total number of tweets written in a given language. The color hue will range from less to more dark with increasing number of tweets. A slider ALPHA controls the transparency. Another one, RED controls the amount of red that is being plotted. It can be used to affect the color range in which the heat map will move. A third and last slider BETA can be use to highlight and enhance the differences across cells. See section Sigmoid for an explanation. (to be done) Additionally, a dropdown list allows the user to select the current language.

Bilingual

The purpose of the bilingual visualization mode is to gain insights into the relative proportion of one language with respect to a second one. Two dropdown lists allow the selection of language one and two. A set of sliders, similar to the one in the monolingual settings, is available. Language one will be mapped to the lighter colour, whereas language two will be displayed darker.

Multilingual

In the multilingual setting a lighter colour is mapped to a higher cell values. The meaning of each cell value can be tuned with the option mode, which is available both in the menu frame and at run time in the form of a dropdown list.

Init menu

Once all settings are selected the application can be run by clicking on the init button.

Language detection

Language detection was carried out following [Lui & Baldwin 2014]. They found out that a majority approach using langid.py, cld2 and LangDetect consistenly outperformed any other considered individual system (see paper for more information on this).

PackageCoverageOther
LDIG17 languagesTwitter-specific
langid.py97 languages
CLD2> 80 languages
LangDetect53 languages

Dependencies

Several libraries were employed. All of them are part of the JVM ecosystem and were ensambled into uniform Clojure code by taking advantage of the Java-interop facilities that Clojure offers.

Running the application

*Download the app*

The easiest way to run the application is downloading the jar executable Make sure that you have at least version 7 of the JDK installed by inputing this in your command line:

java -version
javac -version

Double click on the downloaded file should work, otherwise try it from the command line as per:

java -jar path/to/urban-tweeters..jar

If you want to build the app yourself, you are going to need a couple of things:

  • A Clojure installation.
  • The easiest way of running Clojure code is using Leiningen.
  • Unfortunately, some of the dependencies are not available from Clojars and won’t be automatically pulled by Leiningen. The workaround is to use the lein-localrepo plugin.
  • Download the jars for unfolding, controlp5, log4j, json4proc and glgraphics and intall them locally following the lein-localrepo instructions.

The application has been reported to run on the vast majority of Mac OS versions and Windows. More concretely, it has been tested on the following Operative Systems:

OSProcessorMemory
OS X Yosemite2,7 GHz Intel Core i58 GB
Ubuntu 14.043,1 GHz Intel Core i58 GB
Windows 72,6 GHz Intel Core i58 GB

If you have any trouble trying to run the application I’d be happy to hear about that through a Issue.

Bugs

There is a known bug that affects (at least some) computers running Ubuntu 15.04. The application starts but any attemp to close the visualization frame results in a core dump failure, meaning that it won’t close. In any case, check that you have a JDK version not older than 7.

Literature

License

:license-gpl-blue.svg

Copyright © 2015 Enrique Manjavacas

About

A Clojure project for visualising the language of tweets in cities

Resources

License

Stars

Watchers

Forks

Packages

No packages published