Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How often dataset updated? #6

Open
ivbeg opened this issue Dec 16, 2021 · 7 comments
Open

How often dataset updated? #6

ivbeg opened this issue Dec 16, 2021 · 7 comments

Comments

@ivbeg
Copy link

ivbeg commented Dec 16, 2021

Hi!

The last publication about Github Explorer was in December 2020, and I am just curious, Is the dataset updated since that time?
I do regular research about government open-source code, and right now, I use GitHub API directly, but it would be great to use gh-api instead.

Best Regards,
Ivan

@alexey-milovidov
Copy link
Member

The interactive dataset is updated every 10 minutes.
And the data dumps (.xz archives) are not updated.

@alexey-milovidov
Copy link
Member

The article itself is not updated, maybe we can do annual update.

@ivbeg
Copy link
Author

ivbeg commented Dec 17, 2021

Thanks, a lot, sounds great! I think it would be great for other users too if update schedule will be mentioned in Github README.md or https://ghe.clickhouse.tech/ website.

About usage of the dataset, I am working on relaunch of Open source government observatory project https://data.world/ibegtin/open-source-government-project. It's rating of government openness based on open source activity of government agencies. It uses government.github.com list of government orgs on github with some additions and calculates country and country groups levels statistics on amount of published code, activites, forks, stars, active developers and e.t.c. for each country.

Since orgs manually mapped to countries it has some limitation of not found orgs related to countries, but still it's quite useful.

I will try to use Github Explorer API than and return with feedback and will be happy to cooperate if some calculations could be made directly other GE database.

@alexey-milovidov
Copy link
Member

I've uploaded the latest dump: https://datasets.clickhouse.com/github_events/tsv/github_events_v3.tsv.xz
It contains the data up to today - now 5.4 billion events, 200 GB.

@ivbeg
Copy link
Author

ivbeg commented Dec 20, 2022

Great! Thanks a lot!

@craigbox
Copy link

Hi Alexey,

Is the pipeline still working?
This query stops at Jan 5.

@alexey-milovidov
Copy link
Member

@craigbox, Hi!

I decided to rewrite the update script to pure SQL for the sake of testing, and it was paused for a few weeks.
See #20

I've deployed the new script, so it should continue being updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants