-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibly add resiliency towards conda zip archives that use data descriptors. #794
Comments
@baszalmstra / @wolfv this comes as a much deeper dive of the issue that I was facing when I opened #772 @nichmor I remember that you were looking into the zip library as well so you may have the most familiarity with the situation above (sorry for the excessive tagging) |
Hi @jpcorreia99 - I am curious if this is also a concern when the outer |
@wolfv I believe that when it is not compressed it should be okay. In the specific case I'm seeing, these conda packages have actually been recompressed at a higher compression level and that's when the issue is arising, it seems that recompression is being done on-the-fly, generating archives with data descriptors. |
Usually the compression should be applied to the inner Do you have any control over these packages? |
Unfortunately not, otherwise I would had prioritized fixing them rather than suggesting this. We just let consumers plug in to whatever conda repositories they own. I noticed Mamba/Conda are resilient to this case (my guess is that it does not attempt to stream decompress), but I understand if that's not an approach that you'd want to take here since this is quite an edge case. |
Yes, mamba and conda both first download the archive, and then decompress from disk, indeed. I would be fine with a workaround/fallback mode, ideally one that doesn't impact the performance, of course :) Maybe we could write the archive to a tempdir first? I was also wondering: does the seek have to jump forward, or backwards? Because backwards we could maybe do with some clever buffering? But it might be hacky as well. |
From reading the zipfile specification, the data descriptor seems to come after the data entry (ex), meaning the seek would need to go forwards.
I was thinking of Tee Streaming it so that the if we detect the error, we have a fallback buffer with the currently read data and store in it the remaining non read data to then do non streaming decompression. Your approach of a tempdir also sounds interesting, but with it we'd lose perf benefits from streaming the package |
But we would always drop the buffer only after extraction, or is there some condition (e.g. header successfully read) after which we can drop the buffer? Because otherwise it sounds like this will be quite memory intense, for example when downloading large packages such as pytorch, ... |
yeah that's a great point, so it does seem to me like storing it in a tempfile may be less memory intensive, at the cost of giving up streaming. Would you prefer this to be an optional opt-in behaviour, keeping the current status of the package streaming and only going for the tempfile approach if enabled? |
hey @jpcorreia99 ! Would you happen to have a small example to reproduce this issue? |
It could be a
Or we could implement it as a fallback behavior, and rattler should re-start the transfer but this time writing to a file. Either one of these options would work for me, I think. |
.Zip
archives (of which.conda
is an alias), when compressed, are formed by a sequence of local headers + data entries (reference).These headers usually contain bytes indicating the
CRC32
and the length of the subsequent data entry. However, sometimes during compression this information cannot be gathered at the moment the header is being written (mostly happens if the compression is happening on-the-fly). In this case, the local header will contain a data descriptor bit, indicating that after the data entry there will be a data descriptor containing this info.Rattler decompresses the conda packages over an HTTP stream (https://github.com/mamba-org/rattler/blob/main/crates/rattler_package_streaming/src/read.rs#L58) . However, streaming is not supported for zip archives using data descriptors (because the data buffer should be SEEKable, which the HTTP stream is not), read more from the maintaner's of zip2 here.
When this is the case, we end up failling the download because we hit this bit of code on the zip2 crate https://github.com/zip-rs/zip2/blob/a7c1230dfa42e69539fe40e4bde7344973a4faa7/src/types.rs#L693-L697.
One possible solution would be to fallback to fully downloading the stream and doing normal zip decompression when the data descriptor error is detected.
I'd like to hear your thoughts regarding if this is would be a valuable contribution to rattler and discuss the performance implications of doing such 😄 . I'm working on a hacky solution to see if our issue is fixed and would be glad to submit it as a starting point for this feature, if deemed an acceptable solution!
The text was updated successfully, but these errors were encountered: