-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyPi downloads badge showing "no longer available" #1671
Comments
Searching for the results in https://pypi.org/pypi/crc8/json , I see no way to get them. They are all set to -1. "downloads":-1 It looks like this is due to the new PyPi which does not serve downloads, yet. |
It seems like this issue can be taken on once this issue is resolved: pypi/warehouse#699 |
Download statistics on PyPI should now be retrieved using Google's BigQuery. Is it possible to add them? |
PyPI currently don't serve download stats via the API. See #716 It is unclear if they'll ever add it back but as noted in the linked issue, it isn't something the core warehouse team are working on. I think @espadrine has already looked at the possibility of using BigQuery and decided against it. I'd assume the same limitation applies now. |
Would be happy to incubate a new project in the badges org, which could be hosted separately, that would run a nightly job that fetches the bigquery data, perhaps pushes it all to s3 or google cloud storage, and if necessary provides an API. Something like that would be a great service to the community. |
If you wanted to take that on as a 'microservice' (i.e: outside the shields codebase and hence not constrained to having to use javascript), I think the most developed wrapper for working with that data is https://github.com/ofek/pypinfo Maybe a small JSON API which can query on demand but cache the result on S3 (which can handle the invalidation itself with expiration dates/rules) might be a good approach. Cache on demand might be easier than trying to process the entire python package registry every day. Would save fetching a bunch of stuff you don't need.. Its all fun and games until loads of people start using it though. There is a reason PyPA don't host this themselves anymore.. |
Hmm, yea, these BigQuery charges seem like they would add up quickly. It's a good point that the vast majority of the data wouldn't be used. I wonder what a minimal version of this would be. Do people use the monthly badges, weekly, yearly, or total badges the most? Or did they, back when the badges worked. I wish we had per-badge stats. Based on what Fetching all the data needed to support yearly, monthly, and daily queries costs about $.50 and is about 22 MB. Those could be refreshed once a day and stored on S3 or Google Cloud ($15/month). The application could snag those files when they change, and put the results into an in-memory database. Seems like something like the Zeit Now OSS plan could handle the load on the order of Shields' requests. |
22 MB? That's smaller than I thought. With the <$4/mo VPS we use, we have 2GB of RAM and 20 GB of SSD, so we'd have plenty of space to keep that data in RAM and dump it to SSD for reboot persistence. So we could have that service for $20/mo. I expect most to not care about the per-day figure, so we can probably stay at roughly $5/mo. Can we afford that?
(We have an awkward per-badge statistic through the rate limit monitoring API: https://github.com/badges/shields/blob/master/lib/sys/monitor.js#L68.) We could also have fun with memcached-like systems :) To be honest, I started working on jsonsync because I wanted to synchronize the https://img.shields.io/$analytics/v1 endpoint. |
It's nice to talk with you about this stuff! An aside about RAM: if we do have RAM to spare I would like to use it to bump up the request-handler cache. That's one of the low hanging fruits when it comes to performance boosting. We've observed this based the home page badges and the frequency at which their corresponding API calls are triggered. Since these badges are rendered all the time, they should stay in the cache until expiration, and they seem to be evicted after minutes. https://github.com/badges/shields/blob/master/lib/request-handler.js#L28 If I recall you had reduced this to avoid OOM conditions, though it would be great to crank it up by 5–10x. … For that matter, since you mention the SSD: a persistent backing seems like a great candidate for this kind of caching! The entries are precious, and we don't care about the difference between microseconds and milliseconds. I was just looking at some already-tested key–value stores that can be backed by disk or other cloud storage, for another project. Preferably we'd sync this across machines, which would give us another 3x boost. Would rsync be a candidate for that? Aha! So let's hold off on the RAM change and switch that over to a persistent on-disk cache, instead. I'll open a new issue. Curious your thoughts on syncing… Okay, back to Python. If it seems safe to keep it in memory, let's do that! If setting up rsync is easy, let's use SSD as the backing. Otherwise I would rather use cloud storage, which is still fast enough to load on startup, and would trivially sync across machines. It also means the BigQuery refresh job could trivially run on another cloud provider, which means more people can have access. Yea, we can afford $5–20. We've gotten some good-sized contributions, and have an expectation of cash flow and also some runway. I don't think we've ever had daily stats. Weekly on up. I think daily refresh would be nice for reducing latency in the weekly totals and being more transparent. If we update at midnight UTC, our numbers would be easy for anyone to match using their own query. |
A CSV with the project name and the monthly download total is 2.2 MB. That's an admittedly dense format, though. Probably it would take up more space as an ES6 Map. Any idea how to measure that? |
/me shivers
Sounds good! In this area, and for SSDs, Facebook's RocksDB is the go-to reflex. Facebook uses it as its MySQL backing store, but so does Ceph, CockroachDB, TiDB…
If I understand rsync correctly, its rolling checksum will flag all the file blocks as changed, and cause a whole-file transmission, so it will be equivalent to a rcp (except in the rare case where nothing changed). Which is fine. We can have a leader that downloads the updates on a clock and batches the writes to its local kv store. It sends the data to its followers that do the same. Each server already knows everyone else's IP address (they are part of secrets.json), and the first server is a natural leader (it is s0, a canadian server). |
pypistats.org provides such information (day, week, month downloads) via a json api (no need to fetch data from google big query). |
Thanks for posting that @paternal - looks like exactly what we need :) |
Currently, the PyPi downloads look like this:
But I would like to see the numbers of the downloads.
In some way,
Examples: PyPI:
https://img.shields.io/pypi/dm/Django.svg
https://img.shields.io/pypi/dw/Django.svg
https://img.shields.io/pypi/dd/Django.svg
But these badges work:
https://img.shields.io/pypi/l/Django.svg
Source code:
shields/server.js
Line 2250 in b126b4e
The text was updated successfully, but these errors were encountered: