-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache metadata lookups for sdists and lazy wheels #12256
base: main
Are you sure you want to change the base?
Conversation
f072af8
to
07a0d2f
Compare
1b26b58
to
7a9db70
Compare
bc2eb29
to
3a25be5
Compare
3a25be5
to
0f20f62
Compare
@ewdurbin: regarding #12257 (comment), I don't think I'd tagged you in this change that will actually reduce requests against files.pythonhosted.org as well! |
@cosmicexplorer unfortunately I don't have enough time available at the moment to delve into your PRs which all sound very interesting. This one caught my attention, though, as it is something I also contemplated to do previously. So I gave it a quick try. And I have questions :)
|
e6e29f2
to
a3256f4
Compare
@sbidoul: thank you so much for your thoughtful feedback!! ^_^
This is fixed in 55f185a, which avoids creating cache entries for git or file urls.
I am looking into this now. I agree that the http cache should retain the
So you're absolutely right about this, but I was under the impression that I had solved this by directly making use of your work in
Please no rush to respond, I really appreciate your feedback. |
351aa56
to
635539b
Compare
After a lengthy investigation, I've discovered that the http cache from In #12257 we were able to make some GET requests much faster by adding HTTP caching headers to make pypi return an empty
I think this PR is the cleanest way to avoid making those unused GET requests against files.pythonhosted.org by making use of the existing "metadata-only fetch" concept, without having to mess around with our |
2089c05
to
5a42ae4
Compare
Is this by design, or is it a bug? If the latter, I think we’d be better just getting the bug fixed… |
I perceive several aspects to this metadata caching topic:
Easier said than done, I know :) And caching is hard. But I'd suggest addressing each aspect independently, as these are very different and independent problems. |
b09437f
to
227d8e8
Compare
Ok. the above has been implemented, which means we only cache metadata for cacheable sdists and lazy wheel dists, and otherwise rely on the CacheControl HTTP cache for PEP 658 metadata. I also created a new exception class so any errors with this new caching are made extremely obvious: pip/src/pip/_internal/exceptions.py Lines 256 to 257 in 227d8e8
|
@sbidoul: before I ask you to review this in depth, does @uranusjr's proposal (which I've implemented here) to use this metadata cache for both lazy wheels and cacheable sdists (sdists which return |
@cosmicexplorer I still have some open questions after a bit of further thinking. My first one is if there is still plan for PyPI to backfill PEP 658 metadata? Assuming that is the case, will Assuming that we conclude that fast-deps is going to be abandoned, then there are only
[edit] if we have |
I believe so yes. |
Backfilling is being tracked at pypi/warehouse#8254.
@cosmicexplorer I don't suppose you saw this question / have a chance to take a look at it? :-) |
cf868b7
to
840b346
Compare
Please see #12208, which lays out the case for the continued use of the In a followup PR, I plan to simply remove the |
a337d34
to
36b0fa8
Compare
When performing `install --dry-run` and PEP 658 .metadata files are available to guide the resolve, do not download the associated wheels. Rather use the distribution information directly from the .metadata files when reporting the results on the CLI and in the --report file. - describe the new --dry-run behavior - finalize linked requirements immediately after resolve - introduce is_concrete - funnel InstalledDistribution through _get_prepared_distribution() too - add test for new install --dry-run functionality (no downloading)
- catch an exception when parsing metadata which only occurs in CI - handle --no-cache-dir - call os.makedirs() before writing to cache too - catch InvalidSchema when attempting git urls with BatchDownloader - fix other test failures - reuse should_cache(req) logic - gzip compress link metadata for a slight reduction in disk space - only cache built sdists - don't check should_cache() when fetching - cache lazy wheel dists - add news - turn debug logs in fetching from cache into exceptions - use scandir over listdir when searching normal wheel cache - handle metadata email parsing errors - correctly handle mutable cached requirement - use bz2 over gzip for an extremely slight improvement in disk usage
36b0fa8
to
b02915a
Compare
This PR is on top of #12186; see the
+392/-154
diff against it at https://github.com/cosmicexplorer/pip/compare/metadata_only_resolve_no_whl...cosmicexplorer:pip:link-metadata-cache?expand=1.Background: Pip Bandwidth Usage Improvements
In 2016, @dstufft tweeted:
Today, @sethmlarson tweeted:
Since PEP 658 from @uranusjr was implemented via #11111 and pypi/warehouse#13649, pip has been making use of shallow metadata files to resolve against, meaning that instead of pulling down several very large wheels over the course of a single
pip install
invocation, only to throw away most of them, we now only download and prepare a single version of each requirement at the conclusion of each pip subcommand. With #12186, we can even avoid downloading any dists at all if we only want to generate an install report withpip install --report --dry-run
. For find-links or other non-pypi indices which haven't implemented PEP 658, #12208 improves thefast-deps
feature to enable metadata-only resolves against those too.Proposal: Caching Metadata to Reduce Number of Requests
The above is all great, but despite reducing bandwidth usage, we haven't yet reduced the total number of requests. As described in #12184, I believe we can safely cache metadata lookup to both drastically improve pip's runtime as well as significantly reduce the number of requests against pypi.
This caching is made safe by extending @sbidoul's prior work (see e.g. #7333, #7296) to ensure
pip._internal.cache.Cache
incorporates all the information that may change the compatibility of a downloaded/prepared distribution resolved from aLink
.Result: 6.5x Speedup, Shaving Off 30 Seconds of Runtime
This change produces a 6.5x speedup against the example tested below, reducing the runtime of this
pip install --report
command from over 36 seconds down to just 5.7 seconds:But possibly more importantly, it also drastically reduces the number of requests made against pypi (this is reflected in the much lower
sys
time in the second command output above), without introducing significant disk space usage (only ~60KB, because we only cache metadata for each dist, and we compress the serialized cache entries):TODO