Add GZip codec #87

funkey · 2018-10-18T22:07:43Z

This adds a dedicated gzip codec, GZip with ID gzip. This replaces the previous alias zlib for gzip, which did not take into account that the compression headers between zlib and gzip differ. In particular, ZLib could not be used to decode data compressed using gzip by a third party (which would contain a gzip header).

For backwards compatibility with archives that have been compressed with gzip (and therefore ZLib), GZip supports both header variants for decoding.

TODO:

Unit tests and/or doctests in docstrings
tox -e py36 passes locally
tox -e py27 passes locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
tox -e docs passes locally
AppVeyor and Travis CI passes
Test coverage to 100% (Coveralls passes)

jakirkham · 2018-10-18T22:31:41Z

numcodecs/gzip.py

+            buf = buffer_tobytes(buf)
+
+        # do compression
+        compress = _zlib.compressobj(self.level, wbits=16 + 15)


Looks like Python 2 doesn't support keyword arguments. Fortunately it does support wbits, it just comes as the third positional argument. Something like _zlib.compressobj(self.level, _zlib.DEFLATED, 16 + 15) should do the trick and work on both Python 2 and Python 3.

jakirkham · 2018-10-18T22:35:16Z

Thanks for working on this, Jan. Needs a few tweaks for CI, but otherwise seems fine to me. Let's see if @alimanfoo has anything to add.

Note: There appears to be a test_alias in numcodecs/tests/test_zlib.py, which needs to be updated (or removed?) so as to not treat gzip as an alias.

funkey · 2018-10-18T22:42:07Z

Yes, help with the CI would be very appreciated. I couldn't get tox to run locally, there are probably a few locations that need some attention.

alimanfoo · 2018-10-19T00:21:32Z

Thank you @funkey, this is much appreciated. Implementation looks good to me, I'm just wrapping up some other work but will try and find time to look into CI issues.

Python 3 supports keyword arguments to `zlib.compressobj`. However it seems Python 2 does not. Fortunately Python 2 does support the arguments that were supplied here, just as conditional arguments. This converts the function into one using positional arguments, which should smooth over the Python 2/3 differences.

As this PR includes a fix for `gzip` handling (namely that it differs from `zlib` by a header), this test is no longer accurate. So we correct it to use the correct ID.

jakirkham · 2018-10-19T00:45:46Z

Added PR ( funkey#1 ) against this PR, which should fix the test issues.

jakirkham · 2018-10-19T01:03:19Z

Do we need a fixture for this as well? Is there anything special we need to do to generate it?

alimanfoo · 2018-10-19T09:11:37Z

The first run of test_backwards_compatibility() should generate a fixture. That fixture (files inside the fixture directory) should be included in this PR. Subsequent runs of that test will then compare with the fixture files to ensure nothing has changed.

jakirkham · 2018-10-19T14:53:09Z

Does one need specific version of dependencies in their environment to do this correctly?

Test fixes for PR 87

jakirkham · 2018-10-19T15:10:11Z

numcodecs/tests/test_gzip.py

+def test_alias():
+    config = dict(id='gzip', level=1)
+    codec = get_codec(config)
+    assert GZip(1) == codec


Guess we should drop this one as well.

xref: funkey#1 (review)

These made sense when `gzip` was treated as an alias of `zlib`. However as that was incorrect and this PR fixes that issue, there is no alias and it does not make sense to test for one. Hence these tests are dropped.

Drop the extra blank line at the end of the file.

jakirkham · 2018-10-19T19:34:48Z

Submitted PR ( funkey#2 ) against this PR, which drops the alias tests, commits the gzip fixture (used Python 3.6), and fixes the flake8 error that CI is reporting.

Test fixes for PR 87 (pt. 2)

alimanfoo

Looks good. Just needs API docs and release notes.

Drop unused imports to fix flake8 errors

jakirkham · 2018-10-20T19:03:52Z

Looks like CI is now green. 😄

The API docs can probably just copy docs/zlib.rst and rename some things. The release notes are in docs/release.rst, which could use a quick sentence on this (unless you want to write more).

alimanfoo · 2018-10-20T20:17:18Z

Just wondering, do we want to be pushing users towards gzip rather than zlib? Is one more broadly compatible than the other?

…

On Sat, 20 Oct 2018, 20:03 jakirkham, ***@***.***> wrote: Looks like CI is now green. 😄 The API docs can probably just copy docs/zlib.rst and rename some things. The release notes are in docs/release.rst, which could use a quick sentence on this (unless you want to write more). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QrAnEDmQEachtVz1ZOMWs6s-Hn-Nks5um3OYgaJpZM4XvU1x> .

To handle compression and decompression with the GZip format in a Python 2/3 compatible way, use a `GzipFile` instance backed by an in-memory `BytesIO` buffer for reading and writing from. Offloads the manual header and footer handling to the Python builtin `gzip` module simplifying the code a bit.

As the compression/decompression strategy has changed to handle headers and footers, update the fixtures accordingly.

jakirkham · 2018-10-30T21:13:39Z

Just to clarify the Python GzipFile object always uses CRC32 from Zlib. AFAICT there is no option to disable this. My reading of the Gzip spec (and please feel free to correct me if I'm wrong) is that including the CRC32 is a required part of the footer. Not sure how tools handle cases where this is not provided or the value is incorrect, but guessing it varies from erroring out (ideally nicely) to gracefully handling this somehow (maybe a warning). Python specifically raises an OSError.

There are some other things that the Gzip spec requires like a remainder of the decompressed file size in the footer. Some optional fields are available as well (file name, file comment, etc.), which we probably don't care about using ourselves, but it would be good to handle them correctly (in the event we or others opt to use them for some reason).

So my take on this is the choice is not between whether to include CRC32 in the Gzip encoding, but whether we let Python do the heavy lifting for us or whether we do it manually ( personally would prefer the former :) ).

jakirkham · 2018-10-30T21:36:03Z

Checking the CRC might be a nice feature. Are there any performance costs? I.e., would users always/sometimes/never want to check the CRC?

So admittedly I'm not an expert on this, but after a bit of reading here's what I have come up with. CRCs are the go to error correction code. While there are others (e.g. Fletcher and Adler), their ability to catch errors pales in comparison to CRCs. The CRC algorithm is based on polynomial division, which makes it effective at catching errors, but also makes it slow compared to the alternatives.

Research has been done to find a few good polynomials performance-wise that pretty much all CRC algorithms use one of. There are various implementations (many using precomputed tables as Zlib appears to) and some chips (Intel with SSE4.2 support) have intrinsics that utilize a hardware implementation of the algorithm. Performance varies depending on the size of the data, whether it is a hardware or software implementation, whether a table is used, how large a chunk of data is processed, which polynomial is used, and compiler optimizations.

Here's a benchmark of common CRC implementations. It's worth noting that mainly networking types are interested in high performance here as they are running CRC all the time (and they have lower tolerance for latency than most). It's unclear to me whether or not it would be that much of a concern for our use cases. However this could be a deciding factor between Gzip and Zlib encodings for our users (or perhaps motivation to pick up a Zlib patched for better performance like Intel's or zlib-ng if using CRC is important).

alimanfoo · 2018-10-30T21:56:45Z

Sorry for lack of clarity, I was meaning that when we decompress we have a choice about whether to check the CRC. IIUC the current implementation in this PR does *not* check the CRC. If we switched to using gzip.decompress() (which uses GzipFile.read()) then the CRC would *always* get checked. Computing the CRC to check against the footer when decompressing could have some noticeable performance overhead, or might be negligible, I don't know. If there was some overhead, a user might want to turn it off, i.e., don't bother to compute the CRC and check. If not, and performance was equal between different possible implementations under discussion, then we might prefer to use the gzip module because it includes a CRC check and that doesn't hurt. I'm happy to defer to you on this, just trying to identify things to consider.

…

On Tue, 30 Oct 2018, 16:15 jakirkham, ***@***.***> wrote: Just to clarify the Python GzipFile object always uses CRC32 from Zlib. AFAICT there is no option to disable this. My reading of the Gzip spec (and please feel free to correct me if I'm wrong) is that including the CRC32 is a required part of the footer. Not sure how tools handle cases where this is not provided or the value is incorrect, but guessing it varies from erroring out (ideally nicely) to gracefully handling this somehow (maybe a warning). Python specifically raises an OSError. There are some other things that the Gzip spec requires like a remainder of the decompressed file size in the footer. Some optional fields are available as well (file name, file comment, etc.), which we probably don't care about using ourselves, but it would be good to handle them correctly (in the event we or others opt to use them for some reason). So my take on this is the choice is not between whether to include CRC32 in the Gzip encoding, but whether we let Python do the heavy lifting for us or whether we do it manually ( personally would prefer the former :) ). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qiio4YA9mQ-0iJ-8jbyWHeueqiaCks5uqMF4gaJpZM4XvU1x> .

jakirkham · 2018-10-30T23:26:27Z

Thanks for the clarification. Also thanks for the feedback.

Given the performance benefit of skipping CRC on reads is unclear and the maintenance burden is noticeable but maybe not significant (as there are still some fixes needed for the current implementation), I would lean towards waiting for user feedback that CRC is causing a performance problem. In the interest of completeness though and also in the interest of compatibility (as this is being proposed to more closely align N5 and Zarr), it seems worthwhile to look at how N5 deals with this issue for comparison.

FWICT N5 is directly using the Apache Commons Compress library, which also computes CRC during the read and raises if there is a mismatch. In OpenJDK (other JDKs may differ), this is just binding to a vendored Zlib (newer OpenJDK's look similar). Given both Java and Python are using zlib, performance should be equivalent for both.

In short, it looks like N5 users are ok leaving CRC on by default. Zarr users still have an easy path out of using CRC should it cause issues by just using the existing Zlib codec instead. Sticking with Python's default GZip behavior keeps the maintenance effort pretty light for us. If we find in the future that users need some combination of GZip with disabled CRC on read for performance reasons, it's certainly easy to revisit and we have gained some useful insights already. Though it doesn't appear to be a common problem thus far.

alimanfoo · 2018-10-30T23:33:59Z

SGTM

…

On Tue, 30 Oct 2018, 18:26 jakirkham, ***@***.***> wrote: Thanks for the clarification. Also thanks for the feedback. Given the performance benefit of skipping CRC on reads is unclear and the maintenance burden is noticeable but maybe not significant (as there are still some fixes needed for the current implementation), I would lean towards waiting for user feedback that CRC is causing a performance problem. In the interest of completeness though and also in the interest of compatibility (as this is being proposed to more closely align N5 and Zarr), it seems worthwhile to look at how N5 deals with this issue for comparison. FWICT N5 is directly <https://github.com/saalfeldlab/n5/blob/master/src/main/java/org/janelia/saalfeldlab/n5/GzipCompression.java#L64> using the Apache Commons Compress library <https://github.com/saalfeldlab/n5/blob/2.0.2/pom.xml#L81-L83>, which also computes CRC during the read <https://github.com/apache/commons-compress/blob/rel/1.14/src/main/java/org/apache/commons/compress/compressors/gzip/GzipCompressorInputStream.java#L286> and raises if there is a mismatch <https://github.com/apache/commons-compress/blob/rel/1.14/src/main/java/org/apache/commons/compress/compressors/gzip/GzipCompressorInputStream.java#L312-L315>. In OpenJDK (other JDKs may differ), this is just binding <https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89a19a29bb9f9807d2a28351ed7f7df/src/share/native/java/util/zip/CRC32.c#L51> to a vendored Zlib <https://github.com/openjdk-mirror/jdk7u-jdk/tree/f4d80957e89a19a29bb9f9807d2a28351ed7f7df/src/share/native/java/util/zip/zlib-1.2.3> (newer OpenJDK's look similar). Given both Java and Python are using zlib, performance should be equivalent for both. In short, it looks like N5 users are ok leaving CRC on by default. Zarr users still have an easy path out of using CRC should it cause issues by just using the existing Zlib codec instead. Sticking with Python's default GZip behavior keeps the maintenance effort pretty light for us. If we find in the future that users need some combination of GZip with disabled CRC on read for performance reasons, it's certainly easy to revisit and we have gained some useful insights already. Though it doesn't appear to be a common problem thus far. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qn3hwSpLbGxEmEHyssjSBq-DShZeks5uqOAkgaJpZM4XvU1x> .

Use gzip's GzipFile to manage compression/decompression

funkey · 2018-10-31T12:59:12Z

@jakirkham: Thanks a lot for your investigation. Using CRC checks during reading as a default sounds good to me, too.

jakirkham · 2018-10-31T17:05:01Z

Thanks all.

Seems like CIs are mostly happy with the exception of Python 3.4. Think we are hitting some edge case of that older Python version that the other Pythons either fixed or didn't have in the first place. Did a little bit of digging, but wasn't able to find the exact issue that was fixed in Python. Was debating remedying it; however, we have dropped Python 3.4 from Zarr just recently. So wondering if we should just do the same here. Proposing just dropping Python 3.4 in PR ( #89 ).

alimanfoo · 2018-10-31T19:35:33Z

Just checking, was the fixture data regenerated after switching to the GzipFile implementation? Looks like it was but on a mobile and wanted to double check.

jakirkham · 2018-10-31T19:50:03Z

Yep, they were regenerated.

Had to explicitly delete the existing gzip fixtures (added earlier in this PR) and rerun the tests to regenerate them. Not sure if that is expected. In any event, seemed to work otherwise.

jakirkham · 2018-10-31T20:05:26Z

Also just ran one chunk of encoded data through the gzip CLI. First wanted to verify that gzip could use it without issues. Second wanted to manually verify the NumPy data used to generate it matched the binary decompressed file's contents post-gzip. Both of these worked correctly without issues.

alimanfoo · 2018-10-31T20:07:27Z

Yep, they were regenerated. Had to explicitly delete the existing gzip fixtures (added earlier in this PR) and rerun the tests to regenerate them. Not sure if that is expected. In any event, seemed to work otherwise.

Great, yes that is the process.

…

-- Please feel free to resend your email and/or contact me by other means if you need an urgent reply. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo>

Fix conflicts with upstream `master`

jakirkham · 2018-11-05T16:13:42Z

Any other thoughts on this or are we ready to merge?

alimanfoo

Looks ready to merge I think, just one suggestion for the release notes.

docs/release.rst

alimanfoo · 2018-11-06T17:07:40Z

Thanks @jakirkham, merge at will.

alimanfoo · 2018-11-06T17:12:32Z

Btw coverage is slightly down because of changes that came in from #93. I should have checked that, just needs a pragma no cover in the except block. Can include here or deal with separately, I don't mind.

jakirkham · 2018-11-06T17:48:48Z

Thanks for pointing that out. Missed that. Decided to breakout the coverage fix in PR ( #96 ) just to keep things easy to follow. HTH

jakirkham · 2018-11-06T20:42:51Z

Thanks all. In it goes. 😄

funkey added 2 commits October 16, 2018 11:06

Add dedicated GZip Codec

410af66

Add tests for GZip codec

160f7b5

funkey mentioned this pull request Oct 18, 2018

Add N5 Support zarr-developers/zarr-python#309

Merged

11 tasks

jakirkham reviewed Oct 18, 2018

View reviewed changes

jakirkham added 2 commits October 18, 2018 20:35

Use zlib as the id in the Zlib alias test

3a80e40

As this PR includes a fix for `gzip` handling (namely that it differs from `zlib` by a header), this test is no longer accurate. So we correct it to use the correct ID.

jakirkham mentioned this pull request Oct 19, 2018

Test fixes for PR 87 funkey/numcodecs#1

Merged

alimanfoo added this to the 0.6.0 milestone Oct 19, 2018

Merge pull request #1 from jakirkham/tst_fixes_pr_87

0e237d3

Test fixes for PR 87

jakirkham reviewed Oct 19, 2018

View reviewed changes

jakirkham added 2 commits October 19, 2018 14:59

Drop gzip and zlib alias tests

66a1d3e

These made sense when `gzip` was treated as an alias of `zlib`. However as that was incorrect and this PR fixes that issue, there is no alias and it does not make sense to test for one. Hence these tests are dropped.

Include test generated gzip fixture

32b8e30

jakirkham mentioned this pull request Oct 19, 2018

Test fixes for PR 87 (pt. 2) funkey/numcodecs#2

Merged

Fix flake8 W391 error

262352d

Drop the extra blank line at the end of the file.

funkey and others added 2 commits October 19, 2018 22:13

Merge pull request #2 from jakirkham/tst_fixes_2_pr_87

666bdfb

Test fixes for PR 87 (pt. 2)

Drop unused imports to fix flake8 errors

00a0a93

jakirkham mentioned this pull request Oct 20, 2018

Drop unused imports to fix flake8 errors funkey/numcodecs#3

Merged

alimanfoo approved these changes Oct 20, 2018

View reviewed changes

Merge pull request #3 from jakirkham/fix_flake8_errs

f950047

Drop unused imports to fix flake8 errors

Add documentation for GZip codec

5281fc8

jakirkham added 2 commits October 30, 2018 16:05

Update GZip fixtures

1c9de0f

As the compression/decompression strategy has changed to handle headers and footers, update the fixtures accordingly.

Merge pull request #4 from jakirkham/use_gzipfile_obj

a62fa59

Use gzip's GzipFile to manage compression/decompression

Merge 'zarr-developers/master' into 'funkey/master'

ee64853

jakirkham mentioned this pull request Nov 2, 2018

Fix conflicts with upstream master funkey/numcodecs#5

Merged

Merge pull request #5 from jakirkham/fix_pr_87_conflicts

559cd6e

Fix conflicts with upstream `master`

jakirkham approved these changes Nov 2, 2018

View reviewed changes

alimanfoo reviewed Nov 6, 2018

View reviewed changes

docs/release.rst Outdated Show resolved Hide resolved

jakirkham added 2 commits November 6, 2018 11:32

Incorporate @alimanfoo's suggestion

eb154f0

Merge 'zarr-developers/master' into 'funkey/master'

d658c8e

Merge 'zarr-developers/master' into 'funkey/master'

8706104

jakirkham merged commit 3827541 into zarr-developers:master Nov 6, 2018

jakirkham mentioned this pull request Feb 2, 2019

numcodecs.Gzip can't read files written with zlib in the gzip format #169

Closed

jakirkham mentioned this pull request Nov 21, 2024

Enriching warning and handling when compression and decompression is done in the different numcodecs version #654

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GZip codec #87

Add GZip codec #87

funkey commented Oct 18, 2018

jakirkham Oct 18, 2018

jakirkham commented Oct 18, 2018

funkey commented Oct 18, 2018

alimanfoo commented Oct 19, 2018

jakirkham commented Oct 19, 2018

jakirkham commented Oct 19, 2018

alimanfoo commented Oct 19, 2018

jakirkham commented Oct 19, 2018

jakirkham Oct 19, 2018

jakirkham commented Oct 19, 2018

alimanfoo left a comment

jakirkham commented Oct 20, 2018

alimanfoo commented Oct 20, 2018 via email

jakirkham commented Oct 30, 2018

jakirkham commented Oct 30, 2018 •

edited

Loading

alimanfoo commented Oct 30, 2018 via email

jakirkham commented Oct 30, 2018

alimanfoo commented Oct 30, 2018 via email

funkey commented Oct 31, 2018

jakirkham commented Oct 31, 2018

alimanfoo commented Oct 31, 2018

jakirkham commented Oct 31, 2018

jakirkham commented Oct 31, 2018

alimanfoo commented Oct 31, 2018 via email

jakirkham commented Nov 5, 2018

alimanfoo left a comment

alimanfoo commented Nov 6, 2018

alimanfoo commented Nov 6, 2018

jakirkham commented Nov 6, 2018

jakirkham commented Nov 6, 2018

Add GZip codec #87

Add GZip codec #87

Conversation

funkey commented Oct 18, 2018

jakirkham Oct 18, 2018

Choose a reason for hiding this comment

jakirkham commented Oct 18, 2018

funkey commented Oct 18, 2018

alimanfoo commented Oct 19, 2018

jakirkham commented Oct 19, 2018

jakirkham commented Oct 19, 2018

alimanfoo commented Oct 19, 2018

jakirkham commented Oct 19, 2018

jakirkham Oct 19, 2018

Choose a reason for hiding this comment

jakirkham commented Oct 19, 2018

alimanfoo left a comment

Choose a reason for hiding this comment

jakirkham commented Oct 20, 2018

alimanfoo commented Oct 20, 2018 via email

jakirkham commented Oct 30, 2018

jakirkham commented Oct 30, 2018 • edited Loading

alimanfoo commented Oct 30, 2018 via email

jakirkham commented Oct 30, 2018

alimanfoo commented Oct 30, 2018 via email

funkey commented Oct 31, 2018

jakirkham commented Oct 31, 2018

alimanfoo commented Oct 31, 2018

jakirkham commented Oct 31, 2018

jakirkham commented Oct 31, 2018

alimanfoo commented Oct 31, 2018 via email

jakirkham commented Nov 5, 2018

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo commented Nov 6, 2018

alimanfoo commented Nov 6, 2018

jakirkham commented Nov 6, 2018

jakirkham commented Nov 6, 2018

jakirkham commented Oct 30, 2018 •

edited

Loading