-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metadata resolve workstream #12921
Comments
I personally am excited to see metadata being used (without downloading the full dist) to optimise the When it comes to testing, I wanted to make you aware of Back on the topic of metadata resolve optimisation, the big one which would speed-up the no-cache resolve is parallel requests. Is this something that you have explored as a possibility? (FWIW, there is an issue for parallel downloads in #825, which mostly pre-dates PEP-658 metadata). |
@pelson responding to the optimization part first as that's what I have more paged in right now:
Do you have a benchmark of sorts to demonstrate this? Parallel downloads of dists is actually something we have already established a foundation for; see the use of pip/src/pip/_internal/operations/prepare.py Lines 465 to 484 in 858a515
I avoided addressing that because it would require figuring out how to represent download progress for parallel downloads, which I wasn't sure about, and I also didn't see an incredible performance improvement when I tested it a few years ago in #7819 (but I'm not sure I was doing the right thing). I was hesitant at first but now that I recall we already have a basis for parallel downloads it would probably make perfect sense to investigate it. Thanks so much for mentioning it! I'll add it to this issue. |
Thanks so much for linking this!! ^_^ So for e.g. #12208 and #12256, a lot of the testing we're doing here is to cover the variety of ways an index may respond to pip requests, so instead of the conformant behavior it looks like Lines 1175 to 1305 in b06d73b
However, it was absolutely appropriate for you to mention this, especially in response to e.g. this bit of the issue:
I was speaking from my knowledge of Twitter from 2017-2020, which initially served a |
@pelson thanks so much for mentioning parallel downloads! It ended up being much less effort than I expected: see #12923. It's still drafted because I whipped it up very quickly, but please take a look at that and ask others to review! One part I'm not sure how to implement is the parallel progress bar, especially because I believe we may not know |
This was kind of my point ☝️ about using a normalized interface for package repositories within
Sadly not. Empirically, I expect the depth of a dependency tree to be much smaller than the total number of dependencies, suggesting there is a benefit from parallelism (the same being true for dists, so I don't see why there wasn't a good improvement experienced in the past...). From
Wow! Cool! If I understand correctly, this isn't for fetching metadata in parallel though, right? Still, I would expect the parallel download of dists to be a major win overall! 🚀 |
Very long post working out the logic in my head. Basically I think metadata caching is a strictly superior approach to parallelism within the metadata resolve process (and I list the assumptions I'm leaning on to justify that), but I do think the 2020 resolver has a space for introducing parallelism in its Please feel free not to read, I really appreciated your post here as it was quite thought-provoking, and it gave me an opportunity to justify this approach further.
That's correct--the current resolver algorithm is backtracking, but still serial. However, with the metadata caching from #12256 and #12257 especially, I would be incredibly surprised if any parallel algorithm could achieve a significant performance improvement. I believe parallel metadata fetching would also require modifying resolution semantics in a backwards-incompatible way, which is something non-pip resolvers are very glad to do but which I would like to avoid if at all possible. Using metadata caching lets us avoid making requests in the first place, which is significantly less complex and performance-intensive than modifying resolution logic to enable parallelism. See #12258 where sufficient caching lets us get down to 3.9 seconds to resolve a very large dependency tree. I would be very interested in prior art that demonstrates a process to fetch metadata in parallel, and the 2020 resolver via A few underlying assumptions of the problem space here:
(1) means we can cache the result of extracting metadata persistently and avoid pinging the server ever again for that metadata (this is also why My impression is that parallelism in the resolution process would allow us to overcome slow network i/o with multiple concurrent requests, the way batched download parallelism in #12923 achieves. My thought process is that rather than overwhelming PyPI with even more parallel requests, we could instead focus on avoiding those requests in the first place. While it's true that these techniques are orthogonal--we could perform parallel resolution along with caching--caching does not require us to modify our resolver logic (which may e.g. introduce nondeterministic errors), and it also achieves a secondary goal: not pinging PyPI so much! See #12256 for a discussion of PyPI bandwidth costs. Does that argument make sense? I'm also thinking about your framing here:
The caching work I'm doing here is all about avoiding those metadata requests in the first place, which I believe is safe and correct to the problem space assumptions I listed above. While the |
@pelson regarding your other point:
I'm sorry! It looks like I misunderstood the purpose of Since pip doesn't need to vendor things it uses for testing, I can see One thing I would definitely love to collaborate on is how these caching changes might interact with
The additional caching from #12258 is independent of the server response. Was this response useful to you? I am looking to understand further how we can work together to move forward to more efficient and effective standardized package resolution ^_^! Please let me know if I'm missing something. |
Also cc @alanbato who I believe has contributed to PyPI; most of the performance gains from these caching techniques are related to pip's internal bookkeeping as opposed to its interaction with PyPI, but please do let me know if you have any comments on the effectiveness of e.g. #12256 and #12257 on reducing PyPI bandwidth costs as well. @pelson: one way we may be able to use |
I've converted all open PRs to drafts except for the ones that are ready for review and have no dependencies. If reviewers are interested:
|
I've taken a look at this one, although I had enough questions before I got to the main code change in |
I apologize--#12208 is actually a truly massive PR, but its modifications are not coupled to other PRs (as it's still behind an experimental flag), and it had undergone multiple rounds of review, so I expected it to at least not be controversial whether it was useful (whereas I acknowledge the others are much more experimental). I'm quite sorry about that--I really appreciate your input on that one (it addresses a pip performance goal I've had for years since employed at Twitter), but by no means is it an easy PR to review. I meant only that I expect it to eventually be accepted, whereas I'm not sure I've made a fully convincing argument for most other PRs in this workstream yet. |
Anything I could contribute here to push things forward? We use Dependabot to automate dependency updates, we also use pip-compile. Since backtracking became the default, figuring out updates takes ages, for our projects it takes multiple hours (up to 10+ hours for some) to figure out the latest version for every dependency. And this is in a case where everything is cached, we use a persistent cache between checks. With the old resolver, it would be done very quickly. I made a ticket at pip-tools: jazzband/pip-tools#2129, but it seems like this has to be improved in pip itself. Especially |
Yes, if you can report a reproducible example of a set of requirements that take excessively long, as a new issue, I will take a look and profile, and see if upcoming resolution improvements will improve the situation. |
@notatallshaw I don't think it's a single requirement that takes a long time, I think it's a combination of how pip currently caches things and how dependabot checks for updates. So what dependabot does is:
Because pip currently does not cache the metadata result, it still has to copy the wheel from cache and extract it, which doesn't take too much time, but it has to do that multiple times (every time it checks a specific dependency). So let's say that one of my dependencies is pip-compile --build-isolation --allow-unsafe --output-file=requirements.txt -P scikit-learn requirements.in This takes almost 7 minutes for me. Then Dependabot would execute: pip-compile --build-isolation --allow-unsafe --output-file=requirements.txt requirements.in This takes about 8.5 minute. I'm checking with Dependabot if we could potentially not do that second call if there was no updates done, then it still takes a long time of course, but it could cut off a big chunk of the runtime. And it would do all this for every dependency that's in the requirements.txt file. This project has a total dependency count of 138, so you can imagine how long it would take. |
I don't understand the value in lopping over the dependencies one at a time, I think there is a better way: Upgrade all the requirements at once and use the collected dependencies and transitive dependencies as a constraints file changing
Can you give an example of the
Dependabot should probably just switch to uv, it will likely be 100x faster to run this process as well as fairly good resolution uv's caching will massively help when doing the loop (although again, even with uv getting rid of the loop will speed things up). |
That's how dependabot delivers updates, one MR per dependency upgrade, also makes it easier to review updates. I don't think I can get them to change that, that's how it works on any platform/package manager.
Yeah I will try to make one, this one contains some internal packages so you won't be able to test with that, will try to make one by hand that has the same dependencies!
Thanks for the suggestion, will look into that! |
I have composed this requirement file:
I have to say that locally it's a lot faster than on the dependabot runner, but it's still about 60 seconds (not sure if server is just busy or maybe some sort of throttling by pypi.org, could also be disk related), so 2 minutes per dependency, still adds up to 4 hours. It also looks like having extra indexes also causes quite some extra time here. Tried it with uv and that's indeed much faster, it got it done in about 6 seconds, I will check if that's something dependabot would want to look at. Edit: Dependabot support for uv is on its way, but it's not finished yet. |
What's the problem this feature will solve?
The 2020 resolver separated the resolution logic from the rest of pip, and made the resolver much easier to read and extend. With
--use-feature=fast-deps
, we began to investigate improved pip performance by avoiding the download of dependencies until we reach a complete resolution. Withinstall --report
, we enabled pip users to read the output of the resolver without needing to download dependencies. However, there remain a few blockers to achieving performance improvements, for multiple use cases:Uncached resolves:
When pip is executed entirely from scratch (without an existing
~/.cache/pip
directory), as is often the case in CI, we are unlikely to get too much faster than we are now (and notably, it's extremely unlikely a non-pip tool could go faster in this case without relying on some sort of remote resolve index). However, there are a couple improvements we can still make here:--use-feature=fast-deps
default, to cover for wheels without PEP 658 metadata backfilled yet.--find-links
repo instead of serving the PyPI simple repository API.fast-deps
is not as fast as it could be, which will be addressed further below.Partially-cached resolves with downloading
When pip is executed with a persistent
~/.cache/pip
directory, we can take advantage of much more caching, and this is the bulk of the work here. In e.g. #11111 and other work, we (mostly) separated metadata resolution from downloading, and this has allowed us to consider how to cache not just downloaded artifacts, but parts of the resolution process itself. This is directly enabled by the clean separation and design of the 2020 resolver. We can cache the following:fast-deps
metadata for a particular wheel (this saves us a few HTTP range requests).These alone may not seem like much, but over the course of an entire resolve, not having to make potentially multiple network requests per dependency and staying within our in-memory pip resolve logic adds up and produces a very significant performance improvement. These also reduce the number of requests we make against PyPI.
But wait! We can go even faster! Because in addition to the metadata cache (which is idempotent and not time-varying--the same wheel hash always maps to the same metadata), we can also cache the result of querying the simple repository API for the list of dists available for a given dependency name! This additional caching requires messing around with HTTP caching headers to see if a given page has changed, but it lets us cache:
Link
s, if unchanged (this saves us an HTML parser invocation).Link
s by interpreter compatibility, if unchanged (this saves us having to calculate interpreter compatibility using the tags logic).Resolves without downloading
With
install --report out.json --dry-run
and the metadata resolve+caching discussed above, we should be able to avoid downloading the files associated with the resolved dists, enabling users to download those dependencies in a later phase (as I achieved at Twitter with pantsbuild/pants#8793). However, we currently don't do so (see #11512), because of a mistake I made in previous implementation work (sorry!). So for this, we just need:install --dry-run
).Describe the solution you'd like
I have created several PRs which achieve all of the above:
Batch downloading [0/2]
For batch downloading of metadata-only dists, we have two phases:
BatchDownloader
download and prepare metadata-only dists in parallel. This produces a drastic performance improvement.fast-deps
fixes [0/1]fast-deps
implementation to achieve excellent performance against the current iteration of PyPI behind fastly, as well as any other HTTP host supporting range requests.Formalize "concrete" vs metadata-only dists [0/3]
To avoid downloading dists for metadata-only commands, we have several phases:
.is_concrete
to ourDistribution
wrappers to codify the concept of "metadata-only" dists.RequirementPreparer
logic.install --dry-run
doesn't download any dists.Metadata caching [0/1]
Caching index pages [0/2]
To optimize the process of obtaining
Link
s to resolve against, we have at least two phases:CacheControl
to implicitly retrieve the cached response after a very fast304 Not Modified
from PyPI.Link
parsing and interpreter compatibility filtering from the metadata cache.Each of these PRs demonstrate some nontrivial performance improvement in their description. All together, the result is quite significant, and never produces a slowdown.
Alternative Solutions
~/.cache/pip
, which can be tracked separately from the wheel cache to speed up resolution without ballooning in size.Link
parsing and interpreter compatibility caching in persistent cache for link parsing and interpreter compatibility #12258 to involve more discussion, as they take up more cache space than the idempotent metadata cache, produce less of a performance improvement, and are more complex to implement. However, nothing else depends on them to work, and they can safely be discussed later after the preceding caching work is done.Additional context
In writing this, I realized we may be able to modify the approach of #12257 to work with
--find-links
repos as well. Those are expected to change much more frequently than index pages using the simple repository API, but may be worth considering after the other work is done.Code of Conduct
The text was updated successfully, but these errors were encountered: