You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am having a weird interaction with async_compression::tokio::bufread::GzipDecoder and tokio::io::AsyncBufReadExt. This example shows that when I use AsyncBufReadExt to read the number of lines in an unzipped file I get 413484 lines, but if I use a Bufreader wrapped around a GzipDecoder on a gzipped version of the same file I only get 65654 lines. I can fix this error by removing an empty line somewhere before the divergence point, at which point both files will report 413483 lines. This makes me think there is some edge-case with the various buffers that cause the GzipDecoder read_lines to terminate early, and any small change (removing that one empty line) manages to get things working again. I can't share the files but would be happy to diagnose further if anyone has suggestions.
EDIT: This error does not occur if I use the synchronous flate2 decompression by the way, so it is something specific to the tokio/async_compression interactions.
The text was updated successfully, but these errors were encountered:
First thing I would try is to read_to_end and check that the lengths match. It seems unlikely that it's an interaction with the outer BufReader, more likely to be the gzip decoder getting an early EOF.
One possibility is that the compressed file consists of multiple concatenated sections. Some decompressors will automatically read these sections and concatenate their output, but for async-compression you must use multiple_members to enable this behaviour. (I'm not sure if there's an easy way to check whether a file is multiple sections or not, the gzip cli doesn't seem to have any way to see them).
You were exactly right! Using read_to_end on the gzipped file resulted in about a quarter of the expected bytes read. Turning on multiple_members was able to resolve both the read_to_end and read_line discrepancies as well. Is it worth considering have this enabled by default if it seems to be the default for other decompressors? Or perhaps increasing the visibility of the option in the docs somewhere? If you would prefer the second I can make a merge request. Thanks again for your incisive and prompt assistance!
I am having a weird interaction with async_compression::tokio::bufread::GzipDecoder and tokio::io::AsyncBufReadExt. This example shows that when I use AsyncBufReadExt to read the number of lines in an unzipped file I get 413484 lines, but if I use a Bufreader wrapped around a GzipDecoder on a gzipped version of the same file I only get 65654 lines. I can fix this error by removing an empty line somewhere before the divergence point, at which point both files will report 413483 lines. This makes me think there is some edge-case with the various buffers that cause the GzipDecoder read_lines to terminate early, and any small change (removing that one empty line) manages to get things working again. I can't share the files but would be happy to diagnose further if anyone has suggestions.
EDIT: This error does not occur if I use the synchronous flate2 decompression by the way, so it is something specific to the tokio/async_compression interactions.
The text was updated successfully, but these errors were encountered: