Possibly add resiliency towards conda zip archives that use data descriptors. #794

jpcorreia99 · 2024-07-25T13:05:53Z

.Zip archives (of which .conda is an alias), when compressed, are formed by a sequence of local headers + data entries (reference).

These headers usually contain bytes indicating the CRC32 and the length of the subsequent data entry. However, sometimes during compression this information cannot be gathered at the moment the header is being written (mostly happens if the compression is happening on-the-fly). In this case, the local header will contain a data descriptor bit, indicating that after the data entry there will be a data descriptor containing this info.

Rattler decompresses the conda packages over an HTTP stream (https://github.com/mamba-org/rattler/blob/main/crates/rattler_package_streaming/src/read.rs#L58) . However, streaming is not supported for zip archives using data descriptors (because the data buffer should be SEEKable, which the HTTP stream is not), read more from the maintaner's of zip2 here.

When this is the case, we end up failling the download because we hit this bit of code on the zip2 crate https://github.com/zip-rs/zip2/blob/a7c1230dfa42e69539fe40e4bde7344973a4faa7/src/types.rs#L693-L697.

One possible solution would be to fallback to fully downloading the stream and doing normal zip decompression when the data descriptor error is detected.

I'd like to hear your thoughts regarding if this is would be a valuable contribution to rattler and discuss the performance implications of doing such 😄 . I'm working on a hacky solution to see if our issue is fixed and would be glad to submit it as a starting point for this feature, if deemed an acceptable solution!

The text was updated successfully, but these errors were encountered:

jpcorreia99 · 2024-07-25T13:07:00Z

@baszalmstra / @wolfv this comes as a much deeper dive of the issue that I was facing when I opened #772

@nichmor I remember that you were looking into the zip library as well so you may have the most familiarity with the situation above

(sorry for the excessive tagging)

wolfv · 2024-07-25T13:11:31Z

Hi @jpcorreia99 - I am curious if this is also a concern when the outer zip file is not compressed (which should be the case for .conda files generally).

jpcorreia99 · 2024-07-25T13:18:02Z

@wolfv I believe that when it is not compressed it should be okay.

In the specific case I'm seeing, these conda packages have actually been recompressed at a higher compression level and that's when the issue is arising, it seems that recompression is being done on-the-fly, generating archives with data descriptors.

wolfv · 2024-07-25T13:23:58Z

Usually the compression should be applied to the inner zst encoded archives. Compressing the outer (zip) layer is probably not a good idea, and will probably also not yield much benefit.

Do you have any control over these packages?

jpcorreia99 · 2024-07-25T13:31:11Z

Unfortunately not, otherwise I would had prioritized fixing them rather than suggesting this. We just let consumers plug in to whatever conda repositories they own.

I noticed Mamba/Conda are resilient to this case (my guess is that it does not attempt to stream decompress), but I understand if that's not an approach that you'd want to take here since this is quite an edge case.

wolfv · 2024-07-25T13:34:36Z

Yes, mamba and conda both first download the archive, and then decompress from disk, indeed.

I would be fine with a workaround/fallback mode, ideally one that doesn't impact the performance, of course :) Maybe we could write the archive to a tempdir first?

I was also wondering: does the seek have to jump forward, or backwards? Because backwards we could maybe do with some clever buffering? But it might be hacky as well.

jpcorreia99 · 2024-07-25T13:47:48Z

I was also wondering: does the seek have to jump forward, or backwards? Because backwards we could maybe do with some clever buffering? But it might be hacky as well.

From reading the zipfile specification, the data descriptor seems to come after the data entry (ex), meaning the seek would need to go forwards.

I would be fine with a workaround/fallback mode, ideally one that doesn't impact the performance, of course :) Maybe we could write the archive to a tempdir first?

I was thinking of Tee Streaming it so that the if we detect the error, we have a fallback buffer with the currently read data and store in it the remaining non read data to then do non streaming decompression. Your approach of a tempdir also sounds interesting, but with it we'd lose perf benefits from streaming the package

wolfv · 2024-07-25T13:52:32Z

But we would always drop the buffer only after extraction, or is there some condition (e.g. header successfully read) after which we can drop the buffer?

Because otherwise it sounds like this will be quite memory intense, for example when downloading large packages such as pytorch, ...

jpcorreia99 · 2024-07-25T13:59:28Z

yeah that's a great point, so it does seem to me like storing it in a tempfile may be less memory intensive, at the cost of giving up streaming.

Would you prefer this to be an optional opt-in behaviour, keeping the current status of the package streaming and only going for the tempfile approach if enabled?

nichmor · 2024-07-25T14:38:03Z

hey @jpcorreia99 ! Would you happen to have a small example to reproduce this issue?

wolfv · 2024-07-25T15:17:07Z

It could be a

runtime options
compile time option

Or we could implement it as a fallback behavior, and rattler should re-start the transfer but this time writing to a file.

Either one of these options would work for me, I think.

jpcorreia99 mentioned this issue Jul 26, 2024

fix: Fallback to fully reading the package stream when downloading before attempting decompression #797

Merged

nichmor closed this as completed in #797 Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibly add resiliency towards conda zip archives that use data descriptors. #794

Possibly add resiliency towards conda zip archives that use data descriptors. #794

jpcorreia99 commented Jul 25, 2024 •

edited

Loading

jpcorreia99 commented Jul 25, 2024 •

edited

Loading

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024 •

edited

Loading

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024

nichmor commented Jul 25, 2024

wolfv commented Jul 25, 2024

Possibly add resiliency towards conda zip archives that use data descriptors. #794

Possibly add resiliency towards conda zip archives that use data descriptors. #794

Comments

jpcorreia99 commented Jul 25, 2024 • edited Loading

jpcorreia99 commented Jul 25, 2024 • edited Loading

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024 • edited Loading

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024

nichmor commented Jul 25, 2024

wolfv commented Jul 25, 2024

jpcorreia99 commented Jul 25, 2024 •

edited

Loading

jpcorreia99 commented Jul 25, 2024 •

edited

Loading

jpcorreia99 commented Jul 25, 2024 •

edited

Loading