A Python tool for downloading tweets with Academic Research credentials and the API v2 developed at Digital Geography Lab.
tweetsearcher is a Python tool designed for downloading tweets from Twitter with Academic Research credentials. It downloads and automatically parses the json response from Twitter's API v2 saving it to a pickled dataframe. If one is inclined to do so, there's another script file included which will turn the pickled dataframes into geopackage files, commonly used in GIS.
This work is based on Seija Sirkiä's (Seija's GitHub) work in creating a collector tool that worked with Twitter's Premium API, but has been considerably rewritten to work with the current version of Twitter's searchtweets-v2 Python library. It was made partly to promote open tools and open science, but there also seemed to be a need for a ready-made and (somewhat) easy-to-use tool to collect Twitter data for academic research.
- Python 3
- Willingness to use the commandline
- Any terminal application for Linux/MacOS
- PowerShell or Command prompt for Windows
You need to have Python 3 installed, preferrably 3.9 or newer for the Anaconda/miniconda distribution of Python if you want to use the environment .yml
file.
Clone this repository with git clone https://github.com/DigitalGeographyLab/tweetsearcher.git
or download the zip file. When that's ready, we recommend you create a virtual environment and install the requirements file with conda env create -f tweetsearcher_env.yml
.
- Create an environment with
conda create --name tweetsearcher
- Activate the environment
conda activate tweetsearcher
- Install packages
conda install -c conda-forge geopandas
pip install searchtweets-v2==1.0.7
Create a credentials file .twitter_keys.yaml
with a text editor of your choice, copy-paste the template below and replace the consumer key, secret and bearer token with your own, and place it in the directory of the repository:
search_tweets_v2:
endpoint: https://api.twitter.com/2/tweets/search/all
consumer_key: <CONSUMER_KEY>
consumer_secret: <CONSUMER_SECRET>
bearer_token: <BEARER_TOKEN>
and create the search parameters file search_config.yaml
with a text editor of your choice, replace the values with ones you want, and place it in the directory of the repository. For more specific instructions on how to build a search query see Twitter's documentation: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query
search_rules:
query: (snow OR rain) place_country:FI -is:retweet
tag: my_weather_test
search_params:
results_per_call: 500
max_tweets: 1000000
output_params:
filename_prefix: my_weather_search
results_per_file: 1000000
For example: The above search config file would search for tweets mentioning snow or rain that have been geotagged in Finland and which are NOT retweets. The time window from which these tweets are searched from is defined when giving the command (see below). The parameters would return maximum of 500 results per call and a maximum of 100 000 tweets. The resulting file would be saved with the prefix my_weather_search
and one file would contain a maximum of 1 000 000 tweets. If you want to set up a daily collection, remove start_time
and end_time
from the config, then the script will collect tweets from yesterday (i.e. the day before the current day).
Then just navigate to the cloned repository directory on your local machine and type:
python v2_tweets_to_file.py -sd 2020-04-28 -ed 2020-05-29 -o pkl -w 45 -s iterative
and you after a while you should start accumulating pickled dataframes (.pkl
files) one per date, so if you're requesting a full year then you'll be getting 365 files. The -w
flag indicates the wait time in seconds if rate limit is reached. iterative
(the -s
flag) style is good for queries returning large amounts of tweets for each day (e.g. all geotagged tweets within Finland). The resulting files can be combined into one file with combine_tweets.py
script. For queries returning small per-day tweet amounts use bulk
style by typing:
python v2_tweets_to_file.py -sd 2020-05-27 -ed 2020-05-29 -o pkl -s bulk
and you will get just one .pkl
file. Please note that this bulk
option is suitable for only queries where there might be very few tweets per day such as a very specific topic or from a few specific accounts. Using bulk
option on a larger dataset will quickly hit the Twitter rate limit and you won't get your data.
Output files by default are pickled pandas dataframes(.pkl
). They can be read into Python with Pandas library for further processing. Saving to .csv
files is also supported, but some fields containing data types like list
and dict
objects will be converted to plaintext. The flags stand for sd
= start date, ed
= end date, o
= output file format, w
= wait time in seconds (only for iterative
style), and s
= style. Wait time is there to be used if you think you're going to hit the Twitter rate limits when downloading tweets with iterative
, for example when downloading a full year of geotagged tweets from Finland. Please note that the end time date IS NOT collected, the collection stops at 23:59:59 the previous date, in the example case on the 28th of May at 23:59:59.
Use the following command to collect all tweets by users from a specific time period. Requires you to have a csv file with all user ids under a column named usr_id
. The chunk flag (-c
) indicates how many users' tweets should be in one .csv
or .pkl
file. The default is 20.
python timeline_tweets_to_file.py -ul /path/to/list.csv -sd YEAR-MO-DA -ed YEAR-MO-DA -o pkl -op ~/path/to/folder/ -c 50
Please note, that this uses the all
endpoint of the Twitter API v2 and not the user timeline
endpoint, which only allows to collect 3200 most recent tweets.
This collection method requires you to have a bounding box geopackage file, which has been generated with mmqgis
plugin in QGIS. Please note, the bounding box can not be larger than 25 miles by 25 miles. Run this collection method with the following command:
python bbox_tweets_to_file.py -sd YEAR-MO-DA -ed YEAR-MO-DA -w 15 -in 20 -b /path/to/bbox.gpkg -o path/to/results/
If you are collecting a longer time period from a popular place (like NYC, London, Sydney etc.), please use a larger interval number (-in
). This ensures your collection runs faster, hits less rate limits, and has less chance of running out of memory. For instance, a 25 by 25 mile box from a popular place during Twitter's heydays (2014-2017) will easily return more than 150 000 tweets per month.
If you downloaded with iterative
style, you might want to combine the pickled dataframes to one big file. You can do this with combine_tweets.py
. It supports saving to a GeoPackage file (a common spatial file format like shapefile), a pickled Pandas dataframe and a plain csv file. Combining tweets from .csv
files hasn't been implemented yet as csv
files do not retain data types. To combine tweets run the following command in the directory where you have the .pkl
files:
python combine_tweets.py -f gpkg -o my_tweets.gpkg
The above command outputs a geopackage file my_tweets.gpkg
in the WGS-84 coordinate reference system (if it contains geotagged tweets), which you can open in QGIS and other GIS software like ArcGIS. Other supported outputs are .pkl
and .csv
files. Combining tweets works only from .pkl
files.
The script does some reshuffling and renaming of the "raw" output json, mostly out of necessity (duplicate field names etc.) but partly for convenience (similar fields are next to each other). The output file will have individual tweets connected with the requested expansions (like place, media etc) unlike with the raw output where they're as a separate json object. However, for referenced tweets it only returns the referenced tweet id and author id. If there are geotags, the output file will signify whether they're based on gps coordinates or a bounding box centroids, if both are present the gps coordinates are preferred. Please note that the timestamp in the created_at
field is a UTC timestamp and you may want to convert it to a local time zone if you're doing temporal analysis.
The geopackage export script will drop some columns containing unparsed dict
and list
data types, because they're not supported by the file format.
The csv files use semicolon (;) as the separator and utf-8 as their encoding.
If you're not interested in what bots have to say, then you have to do the cleaning up yourself. Checking the source
of the tweet and removing all posts from sources that seem bot-like or automated is a simple first step. There are plenty of bots posting weather data, satellite positions, every dicionary word in a given language etc. After initial bot cleaning, you can use Botometer to do account-specific checking for the rest (the free option has a 500 account daily quota).
This tool is in very early stages of development and issues can arise if downloading very small datasets. Use the bulk option for small datasets.
This tool has been tested on Linux (specifically Ubuntu 18.04, 20.04, 22.04, and Manjaro 21.0.3). Confirmed to work on MacOS Catalina+ and Windows 10.
Please report further issues and/or submit pull requests with fixes.
If you use this script in your research or project, or develop it further, please refer back to us using this:
Väisänen, T., S. Sirkiä, T. Hiippala, O. Järv & T. Toivonen (2021) tweetsearcher: A Python tool for downloading tweets for academic research. DOI: 10.5281/zenodo.4767170
Also in BibTeX:
@software{Vaisanen_etal_2021,
author = {Tuomas Väisänen and
Seija Sirkiä and
Tuomo Hiippala and
Olle Järv and
Tuuli Toivonen},
title = {{DigitalGeographyLab/tweetsearcher: A Python tool
for downloading Tweets using Academic Research
credentials}},
year = 2021,
publisher = {Zenodo},
doi = {10.5281/zenodo.4767170},
url = {https://doi.org/10.5281/zenodo.4767170}
}
If this tool isn't quite what you're looking for, consider taking a look at this nice tool for collecting Twitter data using Academic Research credentials by Christoph Fink: https://gitlab.com/christoph.fink/twitterhistory