Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache metadata lookups for sdists and lazy wheels #12256

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Sep 3, 2023

This PR is on top of #12186; see the +392/-154 diff against it at https://github.com/cosmicexplorer/pip/compare/metadata_only_resolve_no_whl...cosmicexplorer:pip:link-metadata-cache?expand=1.

Background: Pip Bandwidth Usage Improvements

In 2016, @dstufft tweeted:

In the last 24 hours PyPI has served 120 million requests and used 13.7 TB of bandwidth.

Today, @sethmlarson tweeted:

urllib3 recently reached 10 billion total downloads!! 🎉🔟🥳

Since PEP 658 from @uranusjr was implemented via #11111 and pypi/warehouse#13649, pip has been making use of shallow metadata files to resolve against, meaning that instead of pulling down several very large wheels over the course of a single pip install invocation, only to throw away most of them, we now only download and prepare a single version of each requirement at the conclusion of each pip subcommand. With #12186, we can even avoid downloading any dists at all if we only want to generate an install report with pip install --report --dry-run. For find-links or other non-pypi indices which haven't implemented PEP 658, #12208 improves the fast-deps feature to enable metadata-only resolves against those too.

Proposal: Caching Metadata to Reduce Number of Requests

The above is all great, but despite reducing bandwidth usage, we haven't yet reduced the total number of requests. As described in #12184, I believe we can safely cache metadata lookup to both drastically improve pip's runtime as well as significantly reduce the number of requests against pypi.

This caching is made safe by extending @sbidoul's prior work (see e.g. #7333, #7296) to ensure pip._internal.cache.Cache incorporates all the information that may change the compatibility of a downloaded/prepared distribution resolved from a Link.

Result: 6.5x Speedup, Shaving Off 30 Seconds of Runtime

This change produces a 6.5x speedup against the example tested below, reducing the runtime of this pip install --report command from over 36 seconds down to just 5.7 seconds:

# this branch is #12186, which this PR is based off of
> git checkout metadata_only_resolve_no_whl
> python3.8 -m pip install --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m36.410s
user    0m15.706s
sys     0m13.377s
# switch to this PR
> git checkout link-metadata-cache
# enable --use-feature=metadata-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.671s
user    0m4.429s
sys     0m0.123s

But possibly more importantly, it also drastically reduces the number of requests made against pypi (this is reflected in the much lower sys time in the second command output above), without introducing significant disk space usage (only ~60KB, because we only cache metadata for each dist, and we compress the serialized cache entries):

> du -b -s ~/.cache/pip/link-metadata/
60238   /home/cosmicexplorer/.cache/pip/link-metadata/
> find ~/.cache/pip/link-metadata/ -type f | parallel bzcat | jq
...
{
  "metadata": "Metadata-Version: 2.1\nName: h5py\nVersion: 3.1.0\nSummary: Read and write HDF5 files from Python\nHome-page: http://www.h5py.org\nAuthor: Andrew Collette\nAuthor-email: andrew.collette@gmail.com\nMaintainer: Andrew Collette\nMaintainer-email: andrew.collette@gmail.com\nLicense: BSD\nDownload-URL: https://pypi.python.org/pypi/h5py\nPlatform: UNKNOWN\nClassifier: Development Status :: 5 - Production/Stable\nClassifier: Intended Audience :: Developers\nClassifier: Intended Audience :: Information Technology\nClassifier: Intended Audience :: Science/Research\nClassifier: License :: OSI Approved :: BSD License\nClassifier: Programming Language :: Cython\nClassifier: Programming Language :: Python\nClassifier: Programming Language :: Python :: 3\nClassifier: Programming Language :: Python :: 3.6\nClassifier: Programming Language :: Python :: 3.7\nClassifier: Programming Language :: Python :: Implementation :: CPython\nClassifier: Topic :: Scientific/Engineering\nClassifier: Topic :: Database\nClassifier: Topic :: Software Development :: Libraries :: Python Modules\nClassifier: Operating System :: Unix\nClassifier: Operating System :: POSIX :: Linux\nClassifier: Operating System :: MacOS :: MacOS X\nClassifier: Operating System :: Microsoft :: Windows\nRequires-Python: >=3.6\nRequires-Dist: cached-property ; python_version < \"3.8\"\nRequires-Dist: numpy (>=1.12) ; python_version == \"3.6\"\nRequires-Dist: numpy (>=1.14.5) ; python_version == \"3.7\"\nRequires-Dist: numpy (>=1.17.5) ; python_version == \"3.8\"\nRequires-Dist: numpy (>=1.19.3) ; python_version >= \"3.9\"\n\n\nThe h5py package provides both a high- and low-level interface to the HDF5\nlibrary from Python. The low-level interface is intended to be a complete\nwrapping of the HDF5 API, while the high-level component supports  access to\nHDF5 files, datasets and groups using established Python and NumPy concepts.\n\nA strong emphasis on automatic conversion between Python (Numpy) datatypes and\ndata structures and their HDF5 equivalents vastly simplifies the process of\nreading and writing data from Python.\n\nSupports HDF5 versions 1.8.4 and higher.  On Windows, HDF5 is included with\nthe installer.\n\n\n",
  "filename": "h5py-3.1.0-cp38-cp38-manylinux1_x86_64.whl",
  "canonical_name": "h5py"
}

TODO

@cosmicexplorer cosmicexplorer force-pushed the link-metadata-cache branch 2 times, most recently from f072af8 to 07a0d2f Compare September 3, 2023 05:33
@cosmicexplorer cosmicexplorer force-pushed the link-metadata-cache branch 4 times, most recently from 1b26b58 to 7a9db70 Compare September 3, 2023 10:07
@cosmicexplorer cosmicexplorer changed the title make LinkMetadataCache to cache metadata lookups introduce --use-feature=metadata-cache to cache metadata lookups Sep 3, 2023
@cosmicexplorer cosmicexplorer force-pushed the link-metadata-cache branch 3 times, most recently from bc2eb29 to 3a25be5 Compare September 3, 2023 11:48
@cosmicexplorer
Copy link
Contributor Author

@ewdurbin: regarding #12257 (comment), I don't think I'd tagged you in this change that will actually reduce requests against files.pythonhosted.org as well!

@sbidoul
Copy link
Member

sbidoul commented Sep 3, 2023

@cosmicexplorer unfortunately I don't have enough time available at the moment to delve into your PRs which all sound very interesting.

This one caught my attention, though, as it is something I also contemplated to do previously. So I gave it a quick try. And I have questions :)

  • When running pip install --cache-dir=/tmp/pipcache --no-binary=:all: --use-feature=metadata-cache git+https://github.com/pypa/pip-test-package it creates a link-metadata cache entry. Actually, it should not since this link is not cacheable because the head of the branch can change. I tend to think a metadata caching mechanism should somehow re-use the same _should_cache() logic we have to decide whether locally built wheels should be cached or not.
  • When running pip install --cache-dir=/tmp/pipcache --use-feature=metadata-cache lxml which downloads and installs a wheel, it creates a link-metadata cache entry. Is that intended? Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?
  • When running pip install --cache-dir=/tmp/pipcache --use-feature=metadata-cache --no-binary=:all: lxml which downloads and builds a sdist, it also creates a metadata cache entry. One potential issue is that the cache entry file name seems to be created from a hash of the link only (IIUC). However, the metadata can be different depending on the platform on which it has been generated. My intuition is that the compatibility tags we have in the wheels could play a role here, although I don't know how to obtain them without doing a full wheel build.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Sep 3, 2023

@sbidoul: thank you so much for your thoughtful feedback!! ^_^

When running pip install --cache-dir=/tmp/pipcache --no-binary=:all: --use-feature=metadata-cache git+https://github.com/pypa/pip-test-package it creates a link-metadata cache entry

This is fixed in 55f185a, which avoids creating cache entries for git or file urls.

Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?

I am looking into this now. I agree that the http cache should retain the .metadata from the index, so I will make sure that is working. I also agree that it should be able to pull from the cached wheel, if available.

One potential issue is that the cache entry file name seems to be created from a hash of the link only (IIUC). However, the metadata can be different depending on the platform on which it has been generated. My intuition is that the compatibility tags we have in the wheels could play a role here, although I don't know how to obtain them without doing a full wheel build.

So you're absolutely right about this, but I was under the impression that I had solved this by directly making use of your work in pip._internal.cache.Cache._get_cache_path_parts() from ab05936 to solve #7296, which incorporates the current interpreter name and version. As you said:

I tend to think a metadata caching mechanism should somehow re-use the same _should_cache() logic we have to decide whether locally built wheels should be cached or not.

Please no rush to respond, I really appreciate your feedback.

@cosmicexplorer cosmicexplorer force-pushed the link-metadata-cache branch 3 times, most recently from 351aa56 to 635539b Compare September 3, 2023 21:09
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Sep 3, 2023

Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?

After a lengthy investigation, I've discovered that the http cache from CacheControlAdapter will still make GET requests against files.pythonhosted.org, even when the file is already in the http cache. The response from the server for such requests is ignored. So while debug output with -vvv shows e.g. The response is "fresh", returning cached response, we will still make a request against pypi for each .metadata or .whl. It is nontrivial to work around this.

In #12257 we were able to make some GET requests much faster by adding HTTP caching headers to make pypi return an empty 304 Not Modified, but there are two issues with applying that here:

  1. files.pythonhosted.org doesn't return a 0-length 304 Not Modified, it always returns a full-length 200 OK.
  2. unlike send HTTP caching headers for index pages to further reduce bandwidth usage #12257, we are dealing with immutable links. We don't need to check whether the result has changed.

I think this PR is the cleanest way to avoid making those unused GET requests against files.pythonhosted.org by making use of the existing "metadata-only fetch" concept, without having to mess around with our CacheControlAdapter. What do you think @sbidoul?

@cosmicexplorer cosmicexplorer force-pushed the link-metadata-cache branch 2 times, most recently from 2089c05 to 5a42ae4 Compare September 3, 2023 23:43
src/pip/_internal/cache.py Outdated Show resolved Hide resolved
src/pip/_internal/cache.py Outdated Show resolved Hide resolved
@pfmoore
Copy link
Member

pfmoore commented Sep 4, 2023

After a lengthy investigation, I've discovered that the http cache from CacheControlAdapter will still make GET requests against files.pythonhosted.org, even when the file is already in the http cache.

Is this by design, or is it a bug? If the latter, I think we’d be better just getting the bug fixed…

@sbidoul
Copy link
Member

sbidoul commented Sep 4, 2023

I think this PR is the cleanest way to avoid making those unused GET requests against files.pythonhosted.org by making use of the existing "metadata-only fetch" concept, without having to mess around with our CacheControlAdapter.

I perceive several aspects to this metadata caching topic:

  1. Caching metadata we computed locally is something I consider useful (Cache prepared metadata #11165), because there are several scenario where pip downloads a sdist or a VCS repo to compute the metadata and then discard it, if a wheel is not built for any reason (already installed, rejected version, ...). The decision to cache the metadata or not and the way to structure the cache should likely be as similar as possible to the cache of locally built wheels, for least-surprise behaviour.

  2. Caching .metadata files from indexes. I can't say I fully understand the subtleties of the interactions between the pip http cache and PyPI, but my intuition says these should be cached at the http level (like wheels), and that any potential improvements should be investigated in details in the CacheControl area first before leaking that caching complexity elsewhere in the code base.

  3. Caching metadata obtained from downloaded or cached wheels. For this I'd say we should investigate if there is a tangible performance benefit, because I'm under the impression getting the metadata from inside the locally cached wheel is fast enough.

What do you think @sbidoul?

Easier said than done, I know :) And caching is hard. But I'd suggest addressing each aspect independently, as these are very different and independent problems.

@cosmicexplorer
Copy link
Contributor Author

Ok. the above has been implemented, which means we only cache metadata for cacheable sdists and lazy wheel dists, and otherwise rely on the CacheControl HTTP cache for PEP 658 metadata. I also created a new exception class so any errors with this new caching are made extremely obvious:

class CacheMetadataError(PipError):
"""Raised when de/serializing a requirement into the metadata cache."""

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Sep 18, 2023

@sbidoul: before I ask you to review this in depth, does @uranusjr's proposal (which I've implemented here) to use this metadata cache for both lazy wheels and cacheable sdists (sdists which return True from should_cache()) seem like an appropriate compromise? Otherwise the PEP 658 metadata is indeed already efficiently retrieved from the CacheControl HTTP cache, just as you suspected.

@cosmicexplorer cosmicexplorer mentioned this pull request Sep 18, 2023
1 task
@sbidoul
Copy link
Member

sbidoul commented Sep 20, 2023

@cosmicexplorer I still have some open questions after a bit of further thinking.

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

Assuming that we conclude that fast-deps is going to be abandoned, then there are only 3 2 cases left:

  • local metadata preparations for source trees and sdists: a metadata cache is useful for that, similar to the cache of locally built wheels
  • .metadata cache: the HTTP cache is adequate as you concluded
  • metadata from downloaded wheels: I wonder if optimizations to the http cache such as New HTTP cache with lower memory usage #11143 would help

[edit] if we have .metadata for all wheels in the index, item 3 does not apply anymore

@dstufft
Copy link
Member

dstufft commented Sep 21, 2023

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

I believe so yes.

@edmorley
Copy link
Contributor

edmorley commented Dec 17, 2023

@cosmicexplorer I still have some open questions after a bit of further thinking.

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

Backfilling is being tracked at pypi/warehouse#8254.

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

@cosmicexplorer I don't suppose you saw this question / have a chance to take a look at it? :-)

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Jan 11, 2024

@edmorley @sbidoul:

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

Please see #12208, which lays out the case for the continued use of the fast-deps feature (and also dramatically improves the implementation). Note that @radoering has also recently reviewed that code and even adapted it for poetry already (see his many very helpful comments in that PR), which demonstrates another vote of confidence for the fast-deps feature as perfected in #12208. The fast-deps implementation complexity is strictly limited to lazy_wheel.py with extremely thorough documentation in order to minimize the maintenance cost of this functionality. Also note that pypi itself does not need to do anything at all to support fast-deps, as this implementation is extremely robust and handles a variety of behaviors.

In a followup PR, I plan to simply remove the fast-deps feature and turn it on by default, as one among multiple metadata fetching methods including PEP 658 or cached sdists.

When performing `install --dry-run` and PEP 658 .metadata files are
available to guide the resolve, do not download the associated wheels.

Rather use the distribution information directly from the .metadata
files when reporting the results on the CLI and in the --report file.

- describe the new --dry-run behavior
- finalize linked requirements immediately after resolve
- introduce is_concrete
- funnel InstalledDistribution through _get_prepared_distribution() too
- add test for new install --dry-run functionality (no downloading)
- catch an exception when parsing metadata which only occurs in CI
- handle --no-cache-dir
- call os.makedirs() before writing to cache too
- catch InvalidSchema when attempting git urls with BatchDownloader
- fix other test failures
- reuse should_cache(req) logic
- gzip compress link metadata for a slight reduction in disk space
- only cache built sdists
- don't check should_cache() when fetching
- cache lazy wheel dists
- add news
- turn debug logs in fetching from cache into exceptions
- use scandir over listdir when searching normal wheel cache
- handle metadata email parsing errors
- correctly handle mutable cached requirement
- use bz2 over gzip for an extremely slight improvement in disk usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache prepared metadata
7 participants