Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GZip codec #87

Merged
merged 23 commits into from
Nov 6, 2018
Merged

Add GZip codec #87

merged 23 commits into from
Nov 6, 2018

Conversation

funkey
Copy link
Contributor

@funkey funkey commented Oct 18, 2018

This adds a dedicated gzip codec, GZip with ID gzip. This replaces the previous alias zlib for gzip, which did not take into account that the compression headers between zlib and gzip differ. In particular, ZLib could not be used to decode data compressed using gzip by a third party (which would contain a gzip header).

For backwards compatibility with archives that have been compressed with gzip (and therefore ZLib), GZip supports both header variants for decoding.

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py36 passes locally
  • tox -e py27 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

buf = buffer_tobytes(buf)

# do compression
compress = _zlib.compressobj(self.level, wbits=16 + 15)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Python 2 doesn't support keyword arguments. Fortunately it does support wbits, it just comes as the third positional argument. Something like _zlib.compressobj(self.level, _zlib.DEFLATED, 16 + 15) should do the trick and work on both Python 2 and Python 3.

@jakirkham
Copy link
Member

Thanks for working on this, Jan. Needs a few tweaks for CI, but otherwise seems fine to me. Let's see if @alimanfoo has anything to add.

Note: There appears to be a test_alias in numcodecs/tests/test_zlib.py, which needs to be updated (or removed?) so as to not treat gzip as an alias.

@funkey
Copy link
Contributor Author

funkey commented Oct 18, 2018

Yes, help with the CI would be very appreciated. I couldn't get tox to run locally, there are probably a few locations that need some attention.

@alimanfoo
Copy link
Member

Thank you @funkey, this is much appreciated. Implementation looks good to me, I'm just wrapping up some other work but will try and find time to look into CI issues.

Python 3 supports keyword arguments to `zlib.compressobj`. However it
seems Python 2 does not. Fortunately Python 2 does support the arguments
that were supplied here, just as conditional arguments. This converts
the function into one using positional arguments, which should smooth
over the Python 2/3 differences.
As this PR includes a fix for `gzip` handling (namely that it differs
from `zlib` by a header), this test is no longer accurate. So we correct
it to use the correct ID.
@jakirkham
Copy link
Member

Added PR ( funkey#1 ) against this PR, which should fix the test issues.

@jakirkham
Copy link
Member

Do we need a fixture for this as well? Is there anything special we need to do to generate it?

@alimanfoo
Copy link
Member

The first run of test_backwards_compatibility() should generate a fixture. That fixture (files inside the fixture directory) should be included in this PR. Subsequent runs of that test will then compare with the fixture files to ensure nothing has changed.

@alimanfoo alimanfoo added this to the 0.6.0 milestone Oct 19, 2018
@jakirkham
Copy link
Member

Does one need specific version of dependencies in their environment to do this correctly?

def test_alias():
config = dict(id='gzip', level=1)
codec = get_codec(config)
assert GZip(1) == codec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess we should drop this one as well.

xref: funkey#1 (review)

These made sense when `gzip` was treated as an alias of `zlib`. However
as that was incorrect and this PR fixes that issue, there is no alias
and it does not make sense to test for one. Hence these tests are
dropped.
Drop the extra blank line at the end of the file.
@jakirkham
Copy link
Member

Submitted PR ( funkey#2 ) against this PR, which drops the alias tests, commits the gzip fixture (used Python 3.6), and fixes the flake8 error that CI is reporting.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just needs API docs and release notes.

Drop unused imports to fix flake8 errors
@jakirkham
Copy link
Member

Looks like CI is now green. 😄

The API docs can probably just copy docs/zlib.rst and rename some things. The release notes are in docs/release.rst, which could use a quick sentence on this (unless you want to write more).

@alimanfoo
Copy link
Member

alimanfoo commented Oct 20, 2018 via email

To handle compression and decompression with the GZip format in a Python
2/3 compatible way, use a `GzipFile` instance backed by an in-memory
`BytesIO` buffer for reading and writing from. Offloads the manual
header and footer handling to the Python builtin `gzip` module
simplifying the code a bit.
As the compression/decompression strategy has changed to handle headers
and footers, update the fixtures accordingly.
@jakirkham
Copy link
Member

Just to clarify the Python GzipFile object always uses CRC32 from Zlib. AFAICT there is no option to disable this. My reading of the Gzip spec (and please feel free to correct me if I'm wrong) is that including the CRC32 is a required part of the footer. Not sure how tools handle cases where this is not provided or the value is incorrect, but guessing it varies from erroring out (ideally nicely) to gracefully handling this somehow (maybe a warning). Python specifically raises an OSError.

There are some other things that the Gzip spec requires like a remainder of the decompressed file size in the footer. Some optional fields are available as well (file name, file comment, etc.), which we probably don't care about using ourselves, but it would be good to handle them correctly (in the event we or others opt to use them for some reason).

So my take on this is the choice is not between whether to include CRC32 in the Gzip encoding, but whether we let Python do the heavy lifting for us or whether we do it manually ( personally would prefer the former :) ).

@jakirkham
Copy link
Member

jakirkham commented Oct 30, 2018

Checking the CRC might be a nice feature. Are there any performance costs? I.e., would users always/sometimes/never want to check the CRC?

So admittedly I'm not an expert on this, but after a bit of reading here's what I have come up with. CRCs are the go to error correction code. While there are others (e.g. Fletcher and Adler), their ability to catch errors pales in comparison to CRCs. The CRC algorithm is based on polynomial division, which makes it effective at catching errors, but also makes it slow compared to the alternatives.

Research has been done to find a few good polynomials performance-wise that pretty much all CRC algorithms use one of. There are various implementations (many using precomputed tables as Zlib appears to) and some chips (Intel with SSE4.2 support) have intrinsics that utilize a hardware implementation of the algorithm. Performance varies depending on the size of the data, whether it is a hardware or software implementation, whether a table is used, how large a chunk of data is processed, which polynomial is used, and compiler optimizations.

Here's a benchmark of common CRC implementations. It's worth noting that mainly networking types are interested in high performance here as they are running CRC all the time (and they have lower tolerance for latency than most). It's unclear to me whether or not it would be that much of a concern for our use cases. However this could be a deciding factor between Gzip and Zlib encodings for our users (or perhaps motivation to pick up a Zlib patched for better performance like Intel's or zlib-ng if using CRC is important).

@alimanfoo
Copy link
Member

alimanfoo commented Oct 30, 2018 via email

@jakirkham
Copy link
Member

Thanks for the clarification. Also thanks for the feedback.

Given the performance benefit of skipping CRC on reads is unclear and the maintenance burden is noticeable but maybe not significant (as there are still some fixes needed for the current implementation), I would lean towards waiting for user feedback that CRC is causing a performance problem. In the interest of completeness though and also in the interest of compatibility (as this is being proposed to more closely align N5 and Zarr), it seems worthwhile to look at how N5 deals with this issue for comparison.

FWICT N5 is directly using the Apache Commons Compress library, which also computes CRC during the read and raises if there is a mismatch. In OpenJDK (other JDKs may differ), this is just binding to a vendored Zlib (newer OpenJDK's look similar). Given both Java and Python are using zlib, performance should be equivalent for both.

In short, it looks like N5 users are ok leaving CRC on by default. Zarr users still have an easy path out of using CRC should it cause issues by just using the existing Zlib codec instead. Sticking with Python's default GZip behavior keeps the maintenance effort pretty light for us. If we find in the future that users need some combination of GZip with disabled CRC on read for performance reasons, it's certainly easy to revisit and we have gained some useful insights already. Though it doesn't appear to be a common problem thus far.

@alimanfoo
Copy link
Member

alimanfoo commented Oct 30, 2018 via email

Use gzip's GzipFile to manage compression/decompression
@funkey
Copy link
Contributor Author

funkey commented Oct 31, 2018

@jakirkham: Thanks a lot for your investigation. Using CRC checks during reading as a default sounds good to me, too.

@jakirkham
Copy link
Member

Thanks all.

Seems like CIs are mostly happy with the exception of Python 3.4. Think we are hitting some edge case of that older Python version that the other Pythons either fixed or didn't have in the first place. Did a little bit of digging, but wasn't able to find the exact issue that was fixed in Python. Was debating remedying it; however, we have dropped Python 3.4 from Zarr just recently. So wondering if we should just do the same here. Proposing just dropping Python 3.4 in PR ( #89 ).

@alimanfoo
Copy link
Member

Just checking, was the fixture data regenerated after switching to the GzipFile implementation? Looks like it was but on a mobile and wanted to double check.

@jakirkham
Copy link
Member

Yep, they were regenerated.

Had to explicitly delete the existing gzip fixtures (added earlier in this PR) and rerun the tests to regenerate them. Not sure if that is expected. In any event, seemed to work otherwise.

@jakirkham
Copy link
Member

Also just ran one chunk of encoded data through the gzip CLI. First wanted to verify that gzip could use it without issues. Second wanted to manually verify the NumPy data used to generate it matched the binary decompressed file's contents post-gzip. Both of these worked correctly without issues.

@alimanfoo
Copy link
Member

alimanfoo commented Oct 31, 2018 via email

Fix conflicts with upstream `master`
@jakirkham
Copy link
Member

Any other thoughts on this or are we ready to merge?

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ready to merge I think, just one suggestion for the release notes.

docs/release.rst Outdated Show resolved Hide resolved
@alimanfoo
Copy link
Member

Thanks @jakirkham, merge at will.

@alimanfoo
Copy link
Member

Btw coverage is slightly down because of changes that came in from #93. I should have checked that, just needs a pragma no cover in the except block. Can include here or deal with separately, I don't mind.

@jakirkham
Copy link
Member

Thanks for pointing that out. Missed that. Decided to breakout the coverage fix in PR ( #96 ) just to keep things easy to follow. HTH

@jakirkham jakirkham merged commit 3827541 into zarr-developers:master Nov 6, 2018
@jakirkham
Copy link
Member

Thanks all. In it goes. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants