Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPi downloads badge showing "no longer available" #1671

Closed
niccokunzmann opened this issue May 3, 2018 · 13 comments · Fixed by #2131
Closed

PyPi downloads badge showing "no longer available" #1671

niccokunzmann opened this issue May 3, 2018 · 13 comments · Fixed by #2131
Labels
service-badge New or updated service badge

Comments

@niccokunzmann
Copy link
Contributor

Currently, the PyPi downloads look like this:
https://img.shields.io/badge/downloads-no%20longer%20available-lightgray.svg
But I would like to see the numbers of the downloads.
In some way,

Examples: PyPI:

  • https://img.shields.io/pypi/dm/Django.svg
  • https://img.shields.io/pypi/dw/Django.svg
  • https://img.shields.io/pypi/dd/Django.svg

But these badges work:

  • PyPI - License https://img.shields.io/pypi/l/Django.svg

Source code:

@niccokunzmann niccokunzmann changed the title Pypi Downloads not available PyPi Downloads not available May 3, 2018
@niccokunzmann
Copy link
Contributor Author

niccokunzmann commented May 3, 2018

Searching for the results in https://pypi.org/pypi/crc8/json , I see no way to get them. They are all set to -1. "downloads":-1

It looks like this is due to the new PyPi which does not serve downloads, yet.

@niccokunzmann
Copy link
Contributor Author

It seems like this issue can be taken on once this issue is resolved: pypi/warehouse#699

@niccokunzmann niccokunzmann changed the title PyPi Downloads not available PyPi downloads badge showing "no longer available" May 3, 2018
@ale5000-git
Copy link

Download statistics on PyPI should now be retrieved using Google's BigQuery.

Is it possible to add them?

@chris48s
Copy link
Member

chris48s commented May 4, 2018

PyPI currently don't serve download stats via the API. See #716

It is unclear if they'll ever add it back but as noted in the linked issue, it isn't something the core warehouse team are working on.

I think @espadrine has already looked at the possibility of using BigQuery and decided against it. I'd assume the same limitation applies now.

@chris48s chris48s added the service-badge New or updated service badge label May 4, 2018
@paulmelnikow
Copy link
Member

Would be happy to incubate a new project in the badges org, which could be hosted separately, that would run a nightly job that fetches the bigquery data, perhaps pushes it all to s3 or google cloud storage, and if necessary provides an API. Something like that would be a great service to the community.

@chris48s
Copy link
Member

If you wanted to take that on as a 'microservice' (i.e: outside the shields codebase and hence not constrained to having to use javascript), I think the most developed wrapper for working with that data is https://github.com/ofek/pypinfo

Maybe a small JSON API which can query on demand but cache the result on S3 (which can handle the invalidation itself with expiration dates/rules) might be a good approach. Cache on demand might be easier than trying to process the entire python package registry every day. Would save fetching a bunch of stuff you don't need..

Its all fun and games until loads of people start using it though. There is a reason PyPA don't host this themselves anymore..

@paulmelnikow
Copy link
Member

Hmm, yea, these BigQuery charges seem like they would add up quickly.

It's a good point that the vast majority of the data wouldn't be used.

I wonder what a minimal version of this would be. Do people use the monthly badges, weekly, yearly, or total badges the most? Or did they, back when the badges worked. I wish we had per-badge stats.

Based on what pypinfo is outputting, it seems like fetching a single package's data for one day costs the same as all the packages' data for that day. (Which is roughly $.01.) It makes sense not to put resources into processing counts for packages nobody is interested in, though I don't think hitting BigQuery on demand is going to work…

Fetching all the data needed to support yearly, monthly, and daily queries costs about $.50 and is about 22 MB. Those could be refreshed once a day and stored on S3 or Google Cloud ($15/month). The application could snag those files when they change, and put the results into an in-memory database. Seems like something like the Zeit Now OSS plan could handle the load on the order of Shields' requests.

@espadrine
Copy link
Member

is about 22 MB. Those could be refreshed once a day and stored on S3 or Google Cloud

22 MB? That's smaller than I thought.

With the <$4/mo VPS we use, we have 2GB of RAM and 20 GB of SSD, so we'd have plenty of space to keep that data in RAM and dump it to SSD for reboot persistence.

So we could have that service for $20/mo.

I expect most to not care about the per-day figure, so we can probably stay at roughly $5/mo. Can we afford that?

I wish we had per-badge stats.

(We have an awkward per-badge statistic through the rate limit monitoring API: https://github.com/badges/shields/blob/master/lib/sys/monitor.js#L68.)


We could also have fun with memcached-like systems :) To be honest, I started working on jsonsync because I wanted to synchronize the https://img.shields.io/$analytics/v1 endpoint.

@paulmelnikow
Copy link
Member

It's nice to talk with you about this stuff!


An aside about RAM: if we do have RAM to spare I would like to use it to bump up the request-handler cache. That's one of the low hanging fruits when it comes to performance boosting. We've observed this based the home page badges and the frequency at which their corresponding API calls are triggered. Since these badges are rendered all the time, they should stay in the cache until expiration, and they seem to be evicted after minutes.

https://github.com/badges/shields/blob/master/lib/request-handler.js#L28

If I recall you had reduced this to avoid OOM conditions, though it would be great to crank it up by 5–10x.

For that matter, since you mention the SSD: a persistent backing seems like a great candidate for this kind of caching! The entries are precious, and we don't care about the difference between microseconds and milliseconds. I was just looking at some already-tested key–value stores that can be backed by disk or other cloud storage, for another project.

Preferably we'd sync this across machines, which would give us another 3x boost. Would rsync be a candidate for that?

Aha! So let's hold off on the RAM change and switch that over to a persistent on-disk cache, instead. I'll open a new issue. Curious your thoughts on syncing…


Okay, back to Python. If it seems safe to keep it in memory, let's do that! If setting up rsync is easy, let's use SSD as the backing. Otherwise I would rather use cloud storage, which is still fast enough to load on startup, and would trivially sync across machines. It also means the BigQuery refresh job could trivially run on another cloud provider, which means more people can have access.

Yea, we can afford $5–20. We've gotten some good-sized contributions, and have an expectation of cash flow and also some runway.

I don't think we've ever had daily stats. Weekly on up. I think daily refresh would be nice for reducing latency in the weekly totals and being more transparent. If we update at midnight UTC, our numbers would be easy for anyone to match using their own query.

@paulmelnikow
Copy link
Member

A CSV with the project name and the monthly download total is 2.2 MB. That's an admittedly dense format, though. Probably it would take up more space as an ES6 Map. Any idea how to measure that?

@espadrine
Copy link
Member

If I recall you had reduced this to avoid OOM conditions, though it would be great to crank it up by 5–10x

/me shivers

For that matter, since you mention the SSD: a persistent backing seems like a great candidate for this kind of caching!

Sounds good! In this area, and for SSDs, Facebook's RocksDB is the go-to reflex. Facebook uses it as its MySQL backing store, but so does Ceph, CockroachDB, TiDB…

Preferably we'd sync this across machines, which would give us another 3x boost. Would rsync be a candidate for that?

If I understand rsync correctly, its rolling checksum will flag all the file blocks as changed, and cause a whole-file transmission, so it will be equivalent to a rcp (except in the rare case where nothing changed).

Which is fine. We can have a leader that downloads the updates on a clock and batches the writes to its local kv store. It sends the data to its followers that do the same. Each server already knows everyone else's IP address (they are part of secrets.json), and the first server is a natural leader (it is s0, a canadian server).

@paternal
Copy link

paternal commented Sep 7, 2018

pypistats.org provides such information (day, week, month downloads) via a json api (no need to fetch data from google big query).

@chris48s
Copy link
Member

chris48s commented Sep 7, 2018

Thanks for posting that @paternal - looks like exactly what we need :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
service-badge New or updated service badge
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants