Stuff on extracting Twitter corpora and working with them.
Python script for extracting all tweets (really, lines) containing emoji and sorting them into files by emoji.
get-emoji-tweets-fromtext.py [-h] indir outdir
indir
= input directory of files with tweet texts (one tweet per line)
The script creates output files in outdir
, one file per emoji found in the data (and named with that emoji). Each file contains all the tweets from the input which contain that emoji. Note that a tweet with several different emoji will appear in each of the output files.
List of useful German stop words for filtering tweets. Language identification must be used afterwards. Format: ['word', # of all tweets containing the word in sample, # of German tweets containing word in sample, German/all ratio]. Thanks to Nikolas Zoeller, FH Potsdam.
The Twitter corpus can be found in its own repository: https://github.com/TScheffler/GermanTwitterApril2013 (ids only)