-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GZip codec #87
Add GZip codec #87
Conversation
numcodecs/gzip.py
Outdated
buf = buffer_tobytes(buf) | ||
|
||
# do compression | ||
compress = _zlib.compressobj(self.level, wbits=16 + 15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Python 2 doesn't support keyword arguments. Fortunately it does support wbits
, it just comes as the third positional argument. Something like _zlib.compressobj(self.level, _zlib.DEFLATED, 16 + 15)
should do the trick and work on both Python 2 and Python 3.
Thanks for working on this, Jan. Needs a few tweaks for CI, but otherwise seems fine to me. Let's see if @alimanfoo has anything to add. Note: There appears to be a |
Yes, help with the CI would be very appreciated. I couldn't get |
Thank you @funkey, this is much appreciated. Implementation looks good to me, I'm just wrapping up some other work but will try and find time to look into CI issues. |
Python 3 supports keyword arguments to `zlib.compressobj`. However it seems Python 2 does not. Fortunately Python 2 does support the arguments that were supplied here, just as conditional arguments. This converts the function into one using positional arguments, which should smooth over the Python 2/3 differences.
As this PR includes a fix for `gzip` handling (namely that it differs from `zlib` by a header), this test is no longer accurate. So we correct it to use the correct ID.
Added PR ( funkey#1 ) against this PR, which should fix the test issues. |
Do we need a fixture for this as well? Is there anything special we need to do to generate it? |
The first run of test_backwards_compatibility() should generate a fixture. That fixture (files inside the fixture directory) should be included in this PR. Subsequent runs of that test will then compare with the fixture files to ensure nothing has changed. |
Does one need specific version of dependencies in their environment to do this correctly? |
Test fixes for PR 87
numcodecs/tests/test_gzip.py
Outdated
def test_alias(): | ||
config = dict(id='gzip', level=1) | ||
codec = get_codec(config) | ||
assert GZip(1) == codec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess we should drop this one as well.
xref: funkey#1 (review)
These made sense when `gzip` was treated as an alias of `zlib`. However as that was incorrect and this PR fixes that issue, there is no alias and it does not make sense to test for one. Hence these tests are dropped.
Drop the extra blank line at the end of the file.
Submitted PR ( funkey#2 ) against this PR, which drops the alias tests, commits the |
Test fixes for PR 87 (pt. 2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just needs API docs and release notes.
Drop unused imports to fix flake8 errors
Looks like CI is now green. 😄 The API docs can probably just copy |
Just wondering, do we want to be pushing users towards gzip rather than
zlib? Is one more broadly compatible than the other?
…On Sat, 20 Oct 2018, 20:03 jakirkham, ***@***.***> wrote:
Looks like CI is now green. 😄
The API docs can probably just copy docs/zlib.rst and rename some things.
The release notes are in docs/release.rst, which could use a quick
sentence on this (unless you want to write more).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QrAnEDmQEachtVz1ZOMWs6s-Hn-Nks5um3OYgaJpZM4XvU1x>
.
|
To handle compression and decompression with the GZip format in a Python 2/3 compatible way, use a `GzipFile` instance backed by an in-memory `BytesIO` buffer for reading and writing from. Offloads the manual header and footer handling to the Python builtin `gzip` module simplifying the code a bit.
As the compression/decompression strategy has changed to handle headers and footers, update the fixtures accordingly.
Just to clarify the Python There are some other things that the Gzip spec requires like a remainder of the decompressed file size in the footer. Some optional fields are available as well (file name, file comment, etc.), which we probably don't care about using ourselves, but it would be good to handle them correctly (in the event we or others opt to use them for some reason). So my take on this is the choice is not between whether to include CRC32 in the Gzip encoding, but whether we let Python do the heavy lifting for us or whether we do it manually ( personally would prefer the former :) ). |
So admittedly I'm not an expert on this, but after a bit of reading here's what I have come up with. CRCs are the go to error correction code. While there are others (e.g. Fletcher and Adler), their ability to catch errors pales in comparison to CRCs. The CRC algorithm is based on polynomial division, which makes it effective at catching errors, but also makes it slow compared to the alternatives. Research has been done to find a few good polynomials performance-wise that pretty much all CRC algorithms use one of. There are various implementations (many using precomputed tables as Zlib appears to) and some chips (Intel with SSE4.2 support) have intrinsics that utilize a hardware implementation of the algorithm. Performance varies depending on the size of the data, whether it is a hardware or software implementation, whether a table is used, how large a chunk of data is processed, which polynomial is used, and compiler optimizations. Here's a benchmark of common CRC implementations. It's worth noting that mainly networking types are interested in high performance here as they are running CRC all the time (and they have lower tolerance for latency than most). It's unclear to me whether or not it would be that much of a concern for our use cases. However this could be a deciding factor between Gzip and Zlib encodings for our users (or perhaps motivation to pick up a Zlib patched for better performance like Intel's or zlib-ng if using CRC is important). |
Sorry for lack of clarity, I was meaning that when we decompress we have a
choice about whether to check the CRC. IIUC the current implementation in
this PR does *not* check the CRC. If we switched to using gzip.decompress()
(which uses GzipFile.read()) then the CRC would *always* get checked.
Computing the CRC to check against the footer when decompressing could have
some noticeable performance overhead, or might be negligible, I don't know.
If there was some overhead, a user might want to turn it off, i.e., don't
bother to compute the CRC and check. If not, and performance was equal
between different possible implementations under discussion, then we might
prefer to use the gzip module because it includes a CRC check and that
doesn't hurt.
I'm happy to defer to you on this, just trying to identify things to
consider.
…On Tue, 30 Oct 2018, 16:15 jakirkham, ***@***.***> wrote:
Just to clarify the Python GzipFile object always uses CRC32 from Zlib.
AFAICT there is no option to disable this. My reading of the Gzip spec (and
please feel free to correct me if I'm wrong) is that including the CRC32 is
a required part of the footer. Not sure how tools handle cases where this
is not provided or the value is incorrect, but guessing it varies from
erroring out (ideally nicely) to gracefully handling this somehow (maybe a
warning). Python specifically raises an OSError.
There are some other things that the Gzip spec requires like a remainder
of the decompressed file size in the footer. Some optional fields are
available as well (file name, file comment, etc.), which we probably don't
care about using ourselves, but it would be good to handle them correctly
(in the event we or others opt to use them for some reason).
So my take on this is the choice is not between whether to include CRC32
in the Gzip encoding, but whether we let Python do the heavy lifting for us
or whether we do it manually ( personally would prefer the former :) ).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qiio4YA9mQ-0iJ-8jbyWHeueqiaCks5uqMF4gaJpZM4XvU1x>
.
|
Thanks for the clarification. Also thanks for the feedback. Given the performance benefit of skipping CRC on reads is unclear and the maintenance burden is noticeable but maybe not significant (as there are still some fixes needed for the current implementation), I would lean towards waiting for user feedback that CRC is causing a performance problem. In the interest of completeness though and also in the interest of compatibility (as this is being proposed to more closely align N5 and Zarr), it seems worthwhile to look at how N5 deals with this issue for comparison. FWICT N5 is directly using the Apache Commons Compress library, which also computes CRC during the read and raises if there is a mismatch. In OpenJDK (other JDKs may differ), this is just binding to a vendored Zlib (newer OpenJDK's look similar). Given both Java and Python are using zlib, performance should be equivalent for both. In short, it looks like N5 users are ok leaving CRC on by default. Zarr users still have an easy path out of using CRC should it cause issues by just using the existing Zlib codec instead. Sticking with Python's default GZip behavior keeps the maintenance effort pretty light for us. If we find in the future that users need some combination of GZip with disabled CRC on read for performance reasons, it's certainly easy to revisit and we have gained some useful insights already. Though it doesn't appear to be a common problem thus far. |
SGTM
…On Tue, 30 Oct 2018, 18:26 jakirkham, ***@***.***> wrote:
Thanks for the clarification. Also thanks for the feedback.
Given the performance benefit of skipping CRC on reads is unclear and the
maintenance burden is noticeable but maybe not significant (as there are
still some fixes needed for the current implementation), I would lean
towards waiting for user feedback that CRC is causing a performance
problem. In the interest of completeness though and also in the interest of
compatibility (as this is being proposed to more closely align N5 and
Zarr), it seems worthwhile to look at how N5 deals with this issue for
comparison.
FWICT N5 is directly
<https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/GzipCompression.java#L64>
using the Apache Commons Compress library
<https://github.com/saalfeldlab/n5/blob/2.0.2/pom.xml#L81-L83>, which
also computes CRC during the read
<https://github.com/apache/commons-compress/blob/rel/1.14/src/main/java/org/apache/commons/compress/compressors/gzip/GzipCompressorInputStream.java#L286>
and raises if there is a mismatch
<https://github.com/apache/commons-compress/blob/rel/1.14/src/main/java/org/apache/commons/compress/compressors/gzip/GzipCompressorInputStream.java#L312-L315>.
In OpenJDK (other JDKs may differ), this is just binding
<https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89a19a29bb9f9807d2a28351ed7f7df/src/share/native/java/util/zip/CRC32.c#L51>
to a vendored Zlib
<https://github.com/openjdk-mirror/jdk7u-jdk/tree/f4d80957e89a19a29bb9f9807d2a28351ed7f7df/src/share/native/java/util/zip/zlib-1.2.3>
(newer OpenJDK's look similar). Given both Java and Python are using zlib,
performance should be equivalent for both.
In short, it looks like N5 users are ok leaving CRC on by default. Zarr
users still have an easy path out of using CRC should it cause issues by
just using the existing Zlib codec instead. Sticking with Python's default
GZip behavior keeps the maintenance effort pretty light for us. If we find
in the future that users need some combination of GZip with disabled CRC on
read for performance reasons, it's certainly easy to revisit and we have
gained some useful insights already. Though it doesn't appear to be a
common problem thus far.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qn3hwSpLbGxEmEHyssjSBq-DShZeks5uqOAkgaJpZM4XvU1x>
.
|
Use gzip's GzipFile to manage compression/decompression
@jakirkham: Thanks a lot for your investigation. Using CRC checks during reading as a default sounds good to me, too. |
Thanks all. Seems like CIs are mostly happy with the exception of Python 3.4. Think we are hitting some edge case of that older Python version that the other Pythons either fixed or didn't have in the first place. Did a little bit of digging, but wasn't able to find the exact issue that was fixed in Python. Was debating remedying it; however, we have dropped Python 3.4 from Zarr just recently. So wondering if we should just do the same here. Proposing just dropping Python 3.4 in PR ( #89 ). |
Just checking, was the fixture data regenerated after switching to the GzipFile implementation? Looks like it was but on a mobile and wanted to double check. |
Yep, they were regenerated. Had to explicitly delete the existing gzip fixtures (added earlier in this PR) and rerun the tests to regenerate them. Not sure if that is expected. In any event, seemed to work otherwise. |
Also just ran one chunk of encoded data through the |
Yep, they were regenerated.
Had to explicitly delete the existing gzip fixtures (added earlier in this
PR) and rerun the tests to regenerate them. Not sure if that is expected.
In any event, seemed to work otherwise.
Great, yes that is the process.
…--
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
Fix conflicts with upstream `master`
Any other thoughts on this or are we ready to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ready to merge I think, just one suggestion for the release notes.
Thanks @jakirkham, merge at will. |
Btw coverage is slightly down because of changes that came in from #93. I should have checked that, just needs a pragma no cover in the except block. Can include here or deal with separately, I don't mind. |
Thanks for pointing that out. Missed that. Decided to breakout the coverage fix in PR ( #96 ) just to keep things easy to follow. HTH |
Thanks all. In it goes. 😄 |
This adds a dedicated gzip codec,
GZip
with IDgzip
. This replaces the previous aliaszlib
forgzip
, which did not take into account that the compression headers between zlib and gzip differ. In particular,ZLib
could not be used to decode data compressed using gzip by a third party (which would contain a gzip header).For backwards compatibility with archives that have been compressed with
gzip
(and thereforeZLib
),GZip
supports both header variants for decoding.TODO:
tox -e py36
passes locallytox -e py27
passes locallytox -e docs
passes locally