-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate if the Quota System is providing meaningful value #11902
Comments
I tried to summarize the state of the world fairly objectively above, so here are some of my personal thoughts. I haven't been involved in the quota requests, so I may lack context on some of this! However, I tried to find as much information as possible to fill in the missing information. What I do know is that several people have made reasonable points about how the current system isn't serving them well as users, and then in my digging It appeared like it wasn't serving us well either. I think that there is some kind of value in requiring maintainers of large projects to explicitly request larger quotas. One of the things that struck me is that there were a non-zero number of people in the issues I reviewed that were able to reduce their file size simply by being asked to or being presented with additional options that that hadn't thought about. However, I worry that our current implementation of both the mechanisms and process for handling these is creating friction with users and forcing PyPI's volunteers to spend extra effort on tedium, when it feels like that time could likely be better spent elsewhere. In Stop Allowing Deleting Things from PyPI? a very reasonable discussion was had about the role that PyPI plays in the ecosystem. In that thread, it seemed to me like most people were in favor of PyPI offering some kind of restriction on deletions, though the devil is in the details 1, and I believe in general (though I could be wrong) that all of us who work on PyPI generally believe that the ideal case is that files are left on PyPI indefinitely to act as some kind of archive. However, one of the common suggestions offered to (and a common strategy employed by) projects consuming a large amount of space, is to go back and delete previous releases to free up extra space inside of their quota, something which flies in the face of both what we (I believe) think is the ideal, and what the general tone of that deletion thread is. Though I think that there is a balancing act with that, deleting pre-releases feels relatively fine to me, but deleting actual releases feels kind of bad to me. It also feels like a solution that doesn't scale, if we want PyPI to generally act in part as an archive, then by its nature it is going to continue to grow with each new release of any project hosted on it, even if the community itself doesn't grow 2. One of the major improvements to Python packaging over the years has been the addition of binary wheels and what that has been able to mean for end users trying to install things without having to spend long periods of time setting up compiler tool chains and compiling software. Unfortunately, binaries are often larger than source releases, and the more platforms that a project tries to provide wheels for, the quicker they'll use their quota up. We know that this is already causing some projects to opt not to ship wheels for some platforms, to avoid that problem. My memory could be wrong, but as I recall one of (if not the) major driver for putting quotas in place is what it means for mirrors like bandersnatch that want to mirror a complete copy of PyPI. Unfortunately, even with the quota system, the total size of PyPI has grown to 12TB, pushing a full mirror into consuming most of even the largest drives available to consumers. With the growth of the community and the way the ecosystem is shifting, it doesn't feel sustainable to me to treat a full mirror on a single, reasonably sized drive, as a reasonable goal anymore. Maybe the right solution is less about trying to constrain PyPI's storage and instead, at least as far as bandersnatch is concerned, it should be focusing on providing defaults that will limit the mirrror to the set of packages that people actually use 3. I do think that it would be worthwhile to figure out changes we can make to Warehouse and/or the process to try and reduce this friction and reduce the amount of time spent by our volunteers dealing with these issues. A half-baked idea of what that could look like:
Ultimately, I think that the current system appears to be wasting a non-trivial amount of everyone's time with how manual the process is, and how much of the limit requests appear to be relatively simple rubber stamps, and it's only in the egregious edge cases that we really need to get involved with. I think that the largest benefits of the current system come from the fact that projects can't really just accidentally take up tons of room, that whenever they start to take up a lot, there is a built-in amount of push back to guide them towards reducing their file size, when without it they might not even think about it or notice at all. I also think that while it's good for us to push people to remove old pre-releases to take up less space, that if people feel the need to delete old actual releases, then that's a sign that our policies and tooling are possibly not calibrated correctly for the modern landscape, and we should figure out how to adapt to the way the ecosystem has changed. I think the half-baked proposal above seems pretty reasonable to me, it:
It would be interesting to hear what people think! Footnotes
|
Thanks for the detailed write-up @dstufft!
It's worth expanding a bit on what a "pre-release" means. Alpha/beta/RC releases are one thing - they belong on PyPI for wider user testing, and can be cleaned up. However, nightlies are a completely different thing to me. Given that we (scientific Python projects) are aware of the PyPI space constraints and want to be sensitive to those and not take up a ton of space, we are putting nightlies in a separate wheelhouse that we maintain ourselves, and clean up regularly: https://anaconda.org/scipy-wheels-nightly/ On the other hand, the top four space consumers on https://pypi.org/stats/ are all packages that push nightlies to PyPI. That has always seemed to me to be a bit of an abuse of the service PyPI offers (a And in case nightlies are okay to put on PyPI, I suspect a better interface for cleaning them up would be quite useful. Manually clicking a button and typing a name to verify you're deleting the right thing is not quite encouraging one to clean up.
Agreed in principle. Please do keep in mind that there's no strict corporate vs. community project division. As an example: CuPy is one of the larger consumers of space; it's a project driven by Preferred Networks (a small company), but mostly as a service to the community (it's "NumPy on a GPU") and they're very community-oriented (and may transfer the project to NumFOCUS at some point in the near future).
I do have to say that this has been useful to me at least once; a co-maintainer asked for a size increase for SciPy, and that alerted me to the fact that we were bundling unwanted content into wheels for a particular release.
That all does sound very reasonable and useful to me. |
Looking at the
|
Something that was surfaced in the discussion around deletions was a concern that the quota system on PyPI, as it is currently implemented, is causing a less than ideal experience for both authors and users of PyPI. I've also gone back and read previous discussions or posts like What to do about GPUs? (and the built distributions that support them).
The problems from the maintainer side, that I have seen surfaced:
Just to make sure that everyone is on the same page, the background of how file hosting/quotas has evolved on PyPI is roughly:
pip install ...
by default, projects are required to upload to PyPI unless they want to require their users to configure an additional repository./stats/
route that showed the top N packages and how much storage they consume.That brings us to where we are today.
I don't have really good information for how large PyPI has grown over time other than we're currently at 12TB and in 2018 we were at "> 2TB", but the per project quotas were implemented in 2020. It was mentioned in a comment on Nov 13 13, 2019 that PyPI was currently at 6.5TB
Picking 10GB as our default project quota in PyPI was done with this comment:
At the time, the grandfathered in projects at >= 10G was 73 total projects.
Currently our process for people to ask for increased limits is to have them post a ticket on https://github.com/pypa/pypi-support, and one of the PyPI team will come around and look into it.
I went ahead and did some looking at those requests, and what I found was:
download()
method).That's a lot of information there, but ultimately the questions for this issue are:
Footnotes
This kind of flies in the face of how we typically expect PyPI to be used, as a stable archive of artifacts with deletions being rare. ↩
This directly hurts the consumers of Python packages, as they lose out on the ability to install from wheels on those platforms. ↩
Obviously this is due to the fact PyPI has no staff available to process these requests, relying on when volunteers are able/willing to do pretty tedious work going through issues. ↩
This was ultimately reverted, then reworked, then had more changes to it over the years, but this was the initial PR to add it. ↩
Since per project limits weren't added until 2020, that should mean that all of our project quota requests ended up here. ↩
Categorizing this was kind of lossy, I had to go through all of those issues manually and skim through them, so there very well might have been some miscategorizations in my tally. ↩
This feels kind of like approving the limit in spirit? If a project wants a single 20GB limit, that doesn't feel materially different to me than splitting the project into two, with two 10GB limits. ↩
The text was updated successfully, but these errors were encountered: