arcadia-twitter-collections

Update as of January 2025

We have transitioned to using Airtable to hold Twitter collections rather than Notion:
- As part of Arcadia's move to PubPub Platform, much of each pub's metadata will be stored in Airtable to make sure that our internal systems are not duplicating information on PubPub, reducing errors. Generating and storing Twitter collections on Airtable allows it to automatically pull into PubPub.
- Airtable's structure is simpler than Notion's, allowing some streamlining of the scripts in this repo.
- create_collections.yml can now be triggered as a repository_dispatch event, so we don't need to run a cron every five minutes.

Update as of June 2024

We separated the singular cron job to multiple cron jobs to better deal with different rate limits. Collection creation (that happens outside the Twitter context) is run every 5 minutes. Fetching tweets per collection runs every 2 hours. Fetching quote tweets per tweet runs every 24 hours. The frequency depends on rate-limiting by Twitter.

Update as of April 2024

We have fully moved to a Python-based solution to leverage Tweepy due to better support and maintenance.
Going forward, the cron job runs via GitHub actions.

Update as of June 2023

Twitter decided to significantly change its API functionality and introduced a very steep pricing plan for its usage.
As of now, Filtered Streams are prohibitively expensive to use (available at the $5K/month plan) and Twitter collections are not available at all.
Because of this, we removed the reliance on Filtered Streams and rely solely on the search API. Quote tweets are also not used, because of its strict API rate limits and the inability to filter the quote tweets with since_id.
Twitter collections are replaced by the addition of a comma-separated string field on the Notion database. We use Notion here for its simplicity and because it offers an easy to edit UI. However, it can be easily replaced with your favorite database technology. s

Goals of the project

Context

A lot of discourse around scientific publications happen on Twitter.
For a given publication, Arcadia Science would like to collect relevant tweets (based on DOI number or URL) and display these on the publication page for broader visibility and engagement.
Currently, there isn’t an easy way to have a collection of tweets based on a search query (ie DOI number or URL). This is because Twitter API was modified in 2018 to deprecate this functionality.
We'd like to have a collection of tweets specific to each publication embeddable on our publishing platform, PubPub.

Desired functionality

Easily create a new Twitter collection per pub
Specify what search terms should be used for populating the tweets for the collection
Automate the process of populating the tweets for the collection using Twitter’s API
Easily embed the created collection on our publication platform

Solution

To fulfill these goals, this repo implements two services:

Three periodic jobs that set up the Twitter collections. This job also backfills the tweets for the collection using the Twitter API. The job runs via GitHub Actions actions every hour.

The configuration of what collections to create and what search terms to use are managed in a no-code way in Notion.

If you have any questions about the implementation please reach out to mertcelebi.

Getting started

This repository uses Airtable as the database. The scripts expect that the AIRTABLE_TABLE_ID environment variable points to an Airtable table with the following schema:

Name: str, the text that appears at the top of the embeddable collection

Description: str, the name of the pub. Does not support rich text

Search: str, the Twitter search. This consists of the PubPub url and the DOI url, separated by a comma

URL: str, the URL of the publishing-tools collection that will be embedded.

Tweets: long text, a comma-separated list of tweet IDs

ID: str, a collection ID automatically generated by the create_collections.py script. Note that this is distinct from Airtable's internal record ID.

CREATED_DATE: str (ISO 8601 formatted date), the date the collection was created

Create the environment file

Copy the .env.copy file to .env and fill in the relevant environment variables.

You can get a Twitter bearer token through the Twitter developer dashboard. Note, this repo uses Twitter API v2.

To get the Airtable environment variables, use the web API documentation. Select your base, and navigate to the table where your Twitter collections are stored to gather the table IDs and field IDs.

Install packages

This repository uses conda to manage the base development software environment.

You can find operating system-specific instructions for installing miniconda here. After installing conda and mamba, run the following command to create the pipeline run environment. This will createa a minimal virtual environment.

mamba env create -n twitter-collections --file dev.yml
mamba activate twitter-collections

Install packages using pip. We use pip in addition to conda, to not duplicate package management locally vs. on GitHub actions.

pip install -r requirements.txt

Once the installation is complete, you can run:

The collection creation job with python src/create_collections.py.
The tweet fetching job with python src/fetch_tweets.py.
The quote tweet fetching job with python src/fetch_quote_tweets.py.

Using Airtable as the database

The publishing team uses Airtable to store metadata about pubs and orchestrate the publishing process. Airtable is also used as the source of truth for PubPub Platform. As such, we use Airtable to house the links for standardized embeddable iframes for pubs, such as the Typeform feedback form that appears on all pubs and the Twitter collection that this repository generates and maintains.

We used pyAirtable for simplicity.

If you care for the Twitter API module used as part of this repository check out here.

Using Twitter v1.1 and v2 endpoints

Twitter unfortunately removed the collections functionality in their API v2. So, if you need to use collections, you need access to API v1.1. For this you need "Elevated" API access on Twitter which requires a simple application. Our application took less than 24 hours to get approved.

You can use the search functionality present in API v1.1, but we decided to use API v2 for the search functionality since it's more robust and allows more flexibility (direct search vs. FilteredStreams).

First time a collection is created, this project uses the v2 recent search endpoint for backfilling relevant tweets and for each tweet fetches all its quote tweets. This is suboptimal and inelegant. Sadly, the recent search endpoint does not match quote tweets based on the search query.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
tests		tests
.env.copy		.env.copy
.env.test		.env.test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dev.yml		dev.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arcadia-twitter-collections

Update as of January 2025

Update as of June 2024

Update as of April 2024

Update as of June 2023

Goals of the project

Context

Desired functionality

Solution

Getting started

Create the environment file

Install packages

Using Airtable as the database

Using Twitter v1.1 and v2 endpoints

About

Releases

Packages

Contributors 3

Languages

License

Arcadia-Science/arcadia-twitter-collections

Folders and files

Latest commit

History

Repository files navigation

arcadia-twitter-collections

Update as of January 2025

Update as of June 2024

Update as of April 2024

Update as of June 2023

Goals of the project

Context

Desired functionality

Solution

Getting started

Create the environment file

Install packages

Using Airtable as the database

Using Twitter v1.1 and v2 endpoints

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages