We prepared a dataset from the GH Archive that contains all the events in all GitHub repositories since 2011 in structured format. The dataset was uploaded into ClickHouse, where it contains 3.1 billion records. We redistribute it for research purposes and it can be downloaded at this direct link. This dataset can help answer almost any question about GitHub that you can imagine.
- Counting stars
- Top repositories by stars
- Distribution of repositories by star count
- The total number of repositories on GitHub
- How has the list of top repositories changed over the years?
- How has the total number of stars changed over time?
- Who are all those people giving stars?
- Repository affinity list
- Finding friends through counting stars
- Affinity by issues and PRs
- Repositories with the most stars over one day
- Repositories with the highest growth YoY
- Repositories with the worst stagnation
- Repositories with the most steady growth over time
- What is the best day of the week to catch a star?
- The total number of users on GitHub
- Stars from heavy GitHub users
- Repositories with the maximum amount of pull requests
- Repositories with the maximum amount of issues
- Repositories with the most people who have push access
- Repositories with the maximum number of accepted invitations
- Most forked repositories
- Proportions between stars and forks
- Issues with the most comments
- Top commented issues for each of the top repositories
- Commits with the most comments
- The most tough code reviews
- Authors with the most pushes
- Organizations by the number of stars
- Organizations by the number of repositories
- Organizations by the size of community
- Repositories by amount of modified code
- Repositories by the number of pushes
- Authors with the most code reviews
- Top labels
- The longest repository names
- The shortest repository names
- Repositories with ClickHouse-related comments
- Most popular comments on GitHub
- GitHub roulette