Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mockup partial wheel download during resolvelib resolution #8442

Closed
wants to merge 1 commit into from

Conversation

McSinyx
Copy link
Contributor

@McSinyx McSinyx commented Jun 15, 2020

This is open to facilitate discussions relevant GH-7819. I believe that dry run could be done by injecting print and exit after the resolution is complete, so I'll focus on the matter of real installation:

  1. For pure-python wheel this (the idea) is kinda nice. For project released under GPL the metadata is somewhere in the last 32 kB, and it's 8 kB for others (GPL is long 😄).
  2. The trick doesn't seem to work with extensions, or at least not with manylinux, although IIRC rarely one will have multiple extension requirements at the same time for parallel to make sense. Edit: I'm gonna investigate this tomorrow, but there might be something with the wheel reparation (auditwheel) maybe?
  3. If hashing is required, of course this doesn't work.

After resolution, the undownloaded wheels can be downloaded in parallel, which warranties a higher number (> 5 vs < 5 I guess) of packages to download at the same time.

For local test run, I used oslo-utils=1.4.0 for pure-python with heavy backtracking and axuy for something depending on extension modules (numpy in this case).

cc @pradyunsg, @cosmicexplorer, @ofek and @dholth specifically for opinions on the approach.

@dholth
Copy link
Member

dholth commented Jun 15, 2020

You'll understand that the last 8k will usually get you both the Zip manifest and the metadata. It's probably a good optimization to avoid extra round trips by downloading a reasonable amount of data on the first request.

The last few bytes of the file contain a pointer to the start of the zip manifest. The zip manifest tells you the location of each file in the archive.

So a robust implementation needs to open the partial download with zipfile.ZipFile and be prepared to download additional ranges if necessary. That is why it is useful to have a seekable file-like object that runs over HTTP.

If hashing is required you would need to delay it until after the entire archive was downloaded and be prepared to roll back the entire transaction. It seems like the hashing would need to happen after the resolve step when you have decided to install the whole wheel, anyway?

@cosmicexplorer
Copy link
Contributor

@dholth has some great ideas above!!

So a robust implementation needs to open the partial download with zipfile.ZipFile and be prepared to download additional ranges if necessary. That is why it is useful to have a seekable file-like object that runs over HTTP.

I recall you mentioned this here too: #7819 (comment). I think it might be really good for me to focus on canonicalizing the _hacky_extract_sub_reqs() method (currently in the branch from #7819) that does this wheel shallow metadata download process into your httpfile library.

I think that if I were to extract that process, it would likely be something we could drop directly into this PR by then depending on httpfile.

@McSinyx what do you think about that separation of concerns? Thanks so much for reaching out and creating this PR by the way!!!

@McSinyx
Copy link
Contributor Author

McSinyx commented Jun 16, 2020

It's probably a good optimization to avoid extra round trips by downloading a reasonable amount of data on the first request.

I agree, @dholth, although more experiments must be carried out to determined the reasonable amount of data. E.g. 32 kB for GPL distributions can cost me a second or two (Vietnamese ISPs hate PyPI, rarely I have the megabit connection to the server) so 8 kB might be more reasonable, depending on the popularity of packages.

So a robust implementation needs to open the partial download with zipfile.ZipFile and be prepared to download additional ranges if necessary. That is why it is useful to have a seekable file-like object that runs over HTTP.

Somehow reading previous discussion, I planned to prepend to existing files a constant size 😄 If we feed a seekable HTTP file to ZipFile of sufficient size (so that it's recognized as a ZIP file), would it (metadata requests from pip's pkg_resources on wheel wrapper) magically work, or do we still need to determine the size ourselves? I couldn't come to the conclusion after trying to read ZIP specs and CPython's implementation. I'll try to do the homework harder but it'd be nice if you tell me directly if you already know.

It seems like the hashing would need to happen after the resolve step when you have decided to install the whole wheel, anyway?

Currently not (any longer?; I haven't look into the history)--it's checked right after download--and I think it's by design to ensured the pinned candidates will satisfy the requirements. As said by @pradyunsg,

I think it's OK to disable the partial download logic if we're in --require-hashes mode. Those users have basically provided exact pins (via the hashes). It's not like we'd be making "false" downloads, so deferring to later as a speed boost isn't really a big deal.

We'd be able to make things better once we have a lockfile, but we don't.

although there's another use case where packages also try to have strict dependencies with hashes provided. I think we can figure this out later since it's likely we can just parallel download all wheel with checksum.

I think it might be really good for me to focus on canonicalizing the _hacky_extract_sub_reqs() method (currently in the branch from #7819) that does this wheel shallow metadata download process into [@dholth's] httpfile library.

@cosmicexplorer, I see you're scraping meta data manually. I found pip._internal.utils.wheel.pkg_resources_distribution_for_wheel which can do it for us (there are a few similar routines within pip's codebase doing similar thing, I'm not sure why 😄) and I think you might want to use it.

I'm excited to see httpfile entering pip. I think it'd fit within _internal modules (networking I think) rather than trying to finalize the API to make it a standalone to-be-vendored module. While I'd love to see it as a separate module by itself, I'm more eager to see what we can do with the use case here where we only download from bottom up.

About this PR in particular, I didn't intend it to be merged (nor intended it to not being merged, I'm just trying to see if the way we're going is good enough), so please don't pay too much attention on the path here as you're finalizing _hacky_extract_sub_reqs.

Thank you both for your prompt feedback!

@cosmicexplorer
Copy link
Contributor

I found pip._internal.utils.wheel.pkg_resources_distribution_for_wheel which can do it for us

Thank you so much for directing me to the canonical method here! I will use that for when the wheel file is already downloaded locally! _hacky_extract_sub_reqs() is for the case when the wheel file download is not cached (so we do not have the wheel file locally yet), and we want to make progress on the resolution without having to wait for the entire wheel to download. This is especially useful for very large wheels such as tensorflow, which can be hundreds of megabytes, allowing us to download just a few kilobytes of the wheel instead.

I'm not sure if you already understood that part and I misinterpreted your message :)

I'm excited to see httpfile entering pip. I think it'd fit within _internal modules (networking I think) rather than trying to finalize the API to make it a standalone to-be-vendored module. While I'd love to see it as a separate module by itself, I'm more eager to see what we can do with the use case here where we only download from bottom up.

Before reading this message, I tried to spend a small amount of time to make it into its own library, and that process has led me to agree with you! See https://github.com/cosmicexplorer/httpfile -- it's just the three source files httpfile.py, zipfile.py and wheel.py in src/httpfile that really matter, I definitely don't think it clearly deserves its own library right now.

So, I'll be abandoning https://github.com/cosmicexplorer/httpfile and creating a branch of pip with that httpfile code inside pip._internal.networking instead, as you suggest here. I'll post on #7819 when I've done that.

@pradyunsg
Copy link
Member

It seems like the hashing would need to happen after the resolve step when you have decided to install the whole wheel, anyway?

No, it needs to happen prior to further resolution. Otherwise, it'd be possible for an attacker to craft a file that results in pip downloading from an arbitrary URL (ALA PEP 508), with dependencies crafted carefully to reject that specific candidate later, so that the end result is as otherwise, except pip hit an extra URL and possibly executed code from there.

That completely defeats the purpose of hash checking, which is to verify that the assets we're using are only the ones that match the hashes.

@uranusjr
Copy link
Member

I would suggest to disable partial download (i.e. always eagerly download the whole artifact) when hash mode is enabled. This would be much easier to implement anyway.

@cosmicexplorer
Copy link
Contributor

I would suggest to disable partial download (i.e. always eagerly download the whole artifact) when hash mode is enabled. This would be much easier to implement anyway.

I agree! I was trying to think of alternatives but couldn't figure anything out!

However, I think that if both hash checking and partial downloading are desired, we could consider incorporating two separate checksums into the link for a wheel, one for the wheel's entire contents, and the other for the contents of the METADATA file (which is all we care about anyway in the case of partial downloading).

@uranusjr
Copy link
Member

we could consider incorporating two separate checksums into the link for a wheel, one for the wheel's entire contents, and the other for the contents of the METADATA file (which is all we care about anyway in the case of partial downloading).

I agree. This would likely require some standardisation work to define how things would work though, since it is quite non-trivial for an index to provide such information than the file hash. Another thing to go into the we’re going to get there eventually category.

@cosmicexplorer
Copy link
Contributor

I was going to finish the last message with "but that sounds like something to leave for follow-up work"! It sounds like we are on the same page! Thank you for clarifying!

@McSinyx
Copy link
Contributor Author

McSinyx commented Jun 16, 2020

I will use that for when the wheel file is already downloaded locally! _hacky_extract_sub_reqs() is for the case when the wheel file download is not cached (so we do not have the wheel file locally yet), and we want to make progress on the resolution without having to wait for the entire wheel to download.

If I understand @dholth correctly, then we can have a magical file-like object reading from range requests' result that we can feed to ZipFile, which can then be fed to pkg_resources_distribution_for_wheel. This PR is already doing that (the shallow wheel is written to disk for absolutely no reason though 😄). I'll try to keep up with GH-8448 and see if we can reuse something to make it smaller for quicker reviewing process.

@cosmicexplorer
Copy link
Contributor

Note that #8448 intentionally went out of its way to mock out some of the structure that is defined in this PR with the --shallow-wheels CLI argument, because I was trying to make that PR more "standalone". But I actually think because the added code in #8448 is already unit-tested, I probably shouldn't try to add extra complexity like a command-line option. I will edit #8448 to remove all changes except to pip._internal.network.shallow and testing -- that should hopefully allow us to focus on your integration work in this PR!

@cosmicexplorer
Copy link
Contributor

Then we can have a magical file-like object reading from range requests' result that we can feed to ZipFile, which can then be fed to pkg_resources_distribution_for_wheel

Ah, ok, this was a misunderstanding on my part! Sorry about that! (cc @dholth):

The changes in #8448 do read from HTTP range requests, but they do not create a "magical file-like object" -- instead, the code in #8448 only allows you to download a single file at a time from a remote zip archive (but downloading that file is extremely fast). It doesn't quite achieve what we're thinking about here -- but it's close.

The other thing that #8448 tries to work around is not fulfilling the contract of the pip download command and not actually downloading the wheels at the end, which makes it hard to drop in to other pip uses right now. So if it was changed to actually be a magical file-like object, it would probably be easier to merge.

I will make the following changes to #8448:

  1. Remove the --shallow-wheels cli argument and any modifications to pip download.
  2. Make (experimental) Add partial-wheel-download functionality, to reduce time spent downloading wheels that are eventually discarded and allow parallel downloads #8448 able to actually download the wheels somehow so (experimental) Add partial-wheel-download functionality, to reduce time spent downloading wheels that are eventually discarded and allow parallel downloads #8448 can be dropped in without having to add any command-line arguments!

@pradyunsg
Copy link
Member

pradyunsg commented Jun 16, 2020

I am yet to take a closer look at both this PR and @cosmicexplorer's #8448 (yay easy to remember numbers!).

My view on this entire situation (right now) is that we seem to want to get something working w.r.t. the partial-download-based resolution logic. FWIW, I think even if we end up finishing/merging these PRs now, we'll wait until after the new resolver rollout for making it easier/possible for end users to enable that.

I'm 100% on board for having in-development functionality hidden behind the --unstable-feature flag. That's the mechanism to have code for in-development stuff in pip.

@cosmicexplorer
Copy link
Contributor

I'm 100% on board for having in-development functionality hidden behind the --unstable-feature flag. That's the mechanism to have code for in-development stuff in pip.

That's extremely helpful!!! I think I can then keep #8448 as is, and then just focus on making it download the wheels at the end of the run to keep the contract of pip download, to make it a self-contained PR.

@McSinyx
Copy link
Contributor Author

McSinyx commented Jun 24, 2020

This has served us as a point for discussion and gained me certain understandings. I guess it's time to let it rest in peace.

@McSinyx McSinyx closed this Jun 24, 2020
@McSinyx McSinyx deleted the rubber branch June 24, 2020 01:34
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants