This project aims to compile a list of major news domains along with their associated X (formerly Twitter) accounts. The repository includes auto-refreshing job to fetch real-time statistics related to these X accounts, such as follower count, tweet activity, and engagement metrics. You can find the dataset in news-domains-x.csv.
- Top 100 accounts (sorted by followers) are updated daily.
- The other records are updated daily in batches of 300.
The current dataset contains around 4000 accounts collected from multiple sources and will be continuously enriched and updated.
This project leverages multiple free-tier accounts of the X API to implement its refreshing strategy. Each account can retrieve data for up to 100 accounts daily, a limitation imposed by the X API.
The project leverages GitHub Actions to automatically update the statistics for tracked X accounts:
-
Job 1 (Real-time priority refresh):
- Updates the top 100 most-followed accounts daily.
-
Job 2 & Job 3 & Job 4 (Incremental updates):
- These jobs run in parallel to process accounts in batches of 100. With 3 tokens currently available, records are updated daily in batches of 300.
- The progress is tracked using a JSON file (
state/progress.json
) to ensure no accounts are skipped.
-
Reordering and Cleaning:
- Once the entire list has been processed, it is re-sorted based on the number of followers.
- Inactive or suspended accounts are removed automatically.
-
Commit to GitHub:
- The updated data is committed back to the repository, ensuring the latest statistics are always available.
The data currently used in this project has been sourced from the following repositories:
More sources will be added over time.
We welcome contributions to expand the dataset and improve automation workflows. Feel free to submit issues and pull requests.