Python package for sucking down tweets from Twitter. It implements Twitter's guidelines for working with timelines so that you don't have to.
tweetvac supports retrospective pulling of tweets from Twitter. For example, it can pull down a large number of tweets by a specific user or all the tweets from a geographic area that mentions a search term. It automatically generates the requests to work backward along the timeline.
Install tweetvac using pip:
$ pip install tweetvac
If cloning this repository, you need to install twython and its dependencies.
Twitter requires OAuth. tweetvac can store a user's authentication information in a configuration file for reuse.
- Log into Twitter and open https://dev.twitter.com/apps.
- Create a new application. The name needs to be unique across all Twitter apps. A callback is not needed.
- Create an OAuth access token on your application web page.
- Create a file called tweetvac.cfg and format it as follows:
[Auth] consumer_key = Gx33LSA3IICoqqPoJOp9Q consumer_secret = 1qkKAljfpQMH9EqDZ8t50hK1HbahYXAUEi2p505umY0 oauth_token = 14574199-4iHhtyGRAeCvVzGpPNz0GLwfYC54ba3sK5uBl4hPe oauth_token_secret = K80YytdT9FRXEoADlVzJ64HDQEaUMwb37N9NBykCNw5gw
Alternatively, you can pass those four parameters as a tuple in the
above order into the Tweetvac
constructor rather than storing them
in a configuration file.
import tweetvac
You can pass the OAuth parameters as a tuple:
vac = tweetvac.TweetVac((consumer_key, consumer_secret, oauth_token, oauth_token_secret))
or use the configuration object:
config = tweetvac.AuthConfig()
vac = tweetvac.TweetVac(config)
tweetvac expects a Twitter endpoint and a dictionary of parameters for that endpoint. Read the Twitter documentation for a list of endpoints and their parameters. It is recommended to set the count option in the params dict to the largest value supported by that endpoint.
params = {'screen_name': 'struckDC', 'count': 200}
data = vac.suck('statuses/user_timeline', params)
The data returned is a list of dicts. The fields in the dict are listed in the Twitter API documentation on the Tweet object.
The data can be converted back to json and stored to a file like this:
with open('data.json', 'w') as outfile:
json.dump(data, outfile)
Twitter provides several parameters on each endpoint for selecting what tweets you want to retrieve. Additional culling is available by passing a list of filter functions.
def remove_mention_tweets(tweet):
return not '@' in tweet['text']
data = vac.suck('statuses/user_timeline', params, filters=[remove_mention_tweets])
Return false from your function to remove the tweet from the list.
tweetvac will suck down tweets until you reach your rate limit or you consume all the available tweets. To stop sooner, you can pass a cutoff function that returns true when tweetvac should stop.
def stop(tweet):
cutoff_date = time.strptime("Wed Jan 01 00:00:00 +0000 2014", '%a %b %d %H:%M:%S +0000 %Y')
tweet_date = time.strptime(tweet['created_at'], '%a %b %d %H:%M:%S +0000 %Y')
return tweet_date < cutoff_date
data = vac.suck('statuses/user_timeline', params, cutoff=stop)
You can also pass a hard limit to the number of requests to stop tweetvac early:
data = vac.suck('statuses/user_timeline', params, max_requests=10)
- statuses/user_timeline - tweets by the specified user.
- statuses/home_timeline - tweets by those followed by the authenticating user.
- statuses/mentions_timeline - tweets mentioning the authenticating user.
- statuses/retweets_of_me - tweets that are retweets of the authenticating user.
- search/tweets - search over tweets
The endpoints have different request rate limits, count limits per request, and total tweet count limits.