Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly add resiliency towards conda zip archives that use data descriptors. #794

Closed
jpcorreia99 opened this issue Jul 25, 2024 · 11 comments · Fixed by #797
Closed

Possibly add resiliency towards conda zip archives that use data descriptors. #794

jpcorreia99 opened this issue Jul 25, 2024 · 11 comments · Fixed by #797

Comments

@jpcorreia99
Copy link
Contributor

jpcorreia99 commented Jul 25, 2024

.Zip archives (of which .conda is an alias), when compressed, are formed by a sequence of local headers + data entries (reference).

These headers usually contain bytes indicating the CRC32 and the length of the subsequent data entry. However, sometimes during compression this information cannot be gathered at the moment the header is being written (mostly happens if the compression is happening on-the-fly). In this case, the local header will contain a data descriptor bit, indicating that after the data entry there will be a data descriptor containing this info.

Rattler decompresses the conda packages over an HTTP stream (https://github.com/mamba-org/rattler/blob/main/crates/rattler_package_streaming/src/read.rs#L58) . However, streaming is not supported for zip archives using data descriptors (because the data buffer should be SEEKable, which the HTTP stream is not), read more from the maintaner's of zip2 here.

When this is the case, we end up failling the download because we hit this bit of code on the zip2 crate https://github.com/zip-rs/zip2/blob/a7c1230dfa42e69539fe40e4bde7344973a4faa7/src/types.rs#L693-L697.

One possible solution would be to fallback to fully downloading the stream and doing normal zip decompression when the data descriptor error is detected.

I'd like to hear your thoughts regarding if this is would be a valuable contribution to rattler and discuss the performance implications of doing such 😄 . I'm working on a hacky solution to see if our issue is fixed and would be glad to submit it as a starting point for this feature, if deemed an acceptable solution!

@jpcorreia99
Copy link
Contributor Author

jpcorreia99 commented Jul 25, 2024

@baszalmstra / @wolfv this comes as a much deeper dive of the issue that I was facing when I opened #772

@nichmor I remember that you were looking into the zip library as well so you may have the most familiarity with the situation above

(sorry for the excessive tagging)

@wolfv
Copy link
Contributor

wolfv commented Jul 25, 2024

Hi @jpcorreia99 - I am curious if this is also a concern when the outer zip file is not compressed (which should be the case for .conda files generally).

@jpcorreia99
Copy link
Contributor Author

@wolfv I believe that when it is not compressed it should be okay.

In the specific case I'm seeing, these conda packages have actually been recompressed at a higher compression level and that's when the issue is arising, it seems that recompression is being done on-the-fly, generating archives with data descriptors.

@wolfv
Copy link
Contributor

wolfv commented Jul 25, 2024

Usually the compression should be applied to the inner zst encoded archives. Compressing the outer (zip) layer is probably not a good idea, and will probably also not yield much benefit.

Do you have any control over these packages?

@jpcorreia99
Copy link
Contributor Author

Unfortunately not, otherwise I would had prioritized fixing them rather than suggesting this. We just let consumers plug in to whatever conda repositories they own.

I noticed Mamba/Conda are resilient to this case (my guess is that it does not attempt to stream decompress), but I understand if that's not an approach that you'd want to take here since this is quite an edge case.

@wolfv
Copy link
Contributor

wolfv commented Jul 25, 2024

Yes, mamba and conda both first download the archive, and then decompress from disk, indeed.

I would be fine with a workaround/fallback mode, ideally one that doesn't impact the performance, of course :) Maybe we could write the archive to a tempdir first?

I was also wondering: does the seek have to jump forward, or backwards? Because backwards we could maybe do with some clever buffering? But it might be hacky as well.

@jpcorreia99
Copy link
Contributor Author

jpcorreia99 commented Jul 25, 2024

I was also wondering: does the seek have to jump forward, or backwards? Because backwards we could maybe do with some clever buffering? But it might be hacky as well.

From reading the zipfile specification, the data descriptor seems to come after the data entry (ex), meaning the seek would need to go forwards.

I would be fine with a workaround/fallback mode, ideally one that doesn't impact the performance, of course :) Maybe we could write the archive to a tempdir first?

I was thinking of Tee Streaming it so that the if we detect the error, we have a fallback buffer with the currently read data and store in it the remaining non read data to then do non streaming decompression. Your approach of a tempdir also sounds interesting, but with it we'd lose perf benefits from streaming the package

@wolfv
Copy link
Contributor

wolfv commented Jul 25, 2024

But we would always drop the buffer only after extraction, or is there some condition (e.g. header successfully read) after which we can drop the buffer?

Because otherwise it sounds like this will be quite memory intense, for example when downloading large packages such as pytorch, ...

@jpcorreia99
Copy link
Contributor Author

yeah that's a great point, so it does seem to me like storing it in a tempfile may be less memory intensive, at the cost of giving up streaming.

Would you prefer this to be an optional opt-in behaviour, keeping the current status of the package streaming and only going for the tempfile approach if enabled?

@nichmor
Copy link
Contributor

nichmor commented Jul 25, 2024

hey @jpcorreia99 ! Would you happen to have a small example to reproduce this issue?

@wolfv
Copy link
Contributor

wolfv commented Jul 25, 2024

It could be a

  • runtime options
  • compile time option

Or we could implement it as a fallback behavior, and rattler should re-start the transfer but this time writing to a file.

Either one of these options would work for me, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants