Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GzipDecoder read_lines terminated early, but fixes on deleting an empty line?! #153

Open
mfajer opened this issue Jul 14, 2022 · 2 comments

Comments

@mfajer
Copy link

mfajer commented Jul 14, 2022

I am having a weird interaction with async_compression::tokio::bufread::GzipDecoder and tokio::io::AsyncBufReadExt. This example shows that when I use AsyncBufReadExt to read the number of lines in an unzipped file I get 413484 lines, but if I use a Bufreader wrapped around a GzipDecoder on a gzipped version of the same file I only get 65654 lines. I can fix this error by removing an empty line somewhere before the divergence point, at which point both files will report 413483 lines. This makes me think there is some edge-case with the various buffers that cause the GzipDecoder read_lines to terminate early, and any small change (removing that one empty line) manages to get things working again. I can't share the files but would be happy to diagnose further if anyone has suggestions.

EDIT: This error does not occur if I use the synchronous flate2 decompression by the way, so it is something specific to the tokio/async_compression interactions.

@Nemo157
Copy link
Member

Nemo157 commented Jul 14, 2022

First thing I would try is to read_to_end and check that the lengths match. It seems unlikely that it's an interaction with the outer BufReader, more likely to be the gzip decoder getting an early EOF.

One possibility is that the compressed file consists of multiple concatenated sections. Some decompressors will automatically read these sections and concatenate their output, but for async-compression you must use multiple_members to enable this behaviour. (I'm not sure if there's an easy way to check whether a file is multiple sections or not, the gzip cli doesn't seem to have any way to see them).

@mfajer
Copy link
Author

mfajer commented Jul 15, 2022

You were exactly right! Using read_to_end on the gzipped file resulted in about a quarter of the expected bytes read. Turning on multiple_members was able to resolve both the read_to_end and read_line discrepancies as well. Is it worth considering have this enabled by default if it seems to be the default for other decompressors? Or perhaps increasing the visibility of the option in the docs somewhere? If you would prefer the second I can make a merge request. Thanks again for your incisive and prompt assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants