Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numcodecs.Gzip can't read files written with zlib in the gzip format #169

Closed
constantinpape opened this issue Feb 2, 2019 · 23 comments
Closed

Comments

@constantinpape
Copy link

constantinpape commented Feb 2, 2019

The numcodecs.Gzip codec can't read files that were produced by using the zlib c-api to obtain a gzip compressed file. It fails with

  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'x^')

This is due to the fact that numcodecs.Gzip uses the python gzip library to read and write files. This library uses zlib internally, however it adds additional bytes to the header and it expects these to be present when reading.
These bytes are not present when producing a gzip stream via the zlib c-api:

deflateInit2(&zs, compressionLevel,
                   Z_DEFLATED, MAX_WBITS + 16,
                   MAX_MEM_LEVEL, Z_DEFAULT_STRATEGY)

They are rather part of the gzip file format produced by the unix gzip command.

I would propose to not use python gzip, but rather use python zlib and use it for compression and decompression to gzip compatible format.

Note that this should be backward compatible, because zlib can read files written by unix gzip.
I have only tested this for the zlib c-api, not for python, yet.

@jakirkham
Copy link
Member

Have you read the discussion in PR ( #87 ) already?

@constantinpape
Copy link
Author

constantinpape commented Feb 2, 2019

Thanks for the pointer, I wasn't aware of the discussion, but I had a quick look now.
I disagree with the conclusions drawn though:
the gzip file format including the magic bits in the header and the CRC in the footer is less portable than using the bare gzip stream format produced by zlib.
AFAIK there is no c library that produces the gzip file format, so this places the burden to implement it on any developer wanting to write numcodecs compatible format from c / c++.
The complexity to implement this may not be huge, but still needs thorough testing etc. and I think it's better practice to rely purely on zlib for this.

@jakirkham
Copy link
Member

It sounds like there are some things that are getting confused.

The stream of compressed data is a deflate stream, not a gzip stream. The Zlib codec simply provides deflate if that’s what you want. OTOH the GZip codec creates files that are compatible with the gzip format (this seemed to have the shortest, readable copy of the spec), which require a header and a footer with some specific information that wraps the deflate stream.

To the best of my knowledge, the zlib library does not have a builtin function to handle gzip decompression. However it does have several useful functions for rolling your own.

Previously we had treated GZip as an alias to Zlib. However, as pointed out by @funkey, we were doing this incorrectly and he has since fixed this issue. The CPython implementation of gzip looks similar to the Java implementation that N5 is using. We have used the resulting changes with the gzip command line tools successfully (not possible with our previous implementation). Personally I’ve used the same strategy successfully with other libraries that handle gzip.

@constantinpape
Copy link
Author

constantinpape commented Feb 3, 2019

The stream of compressed data is a deflate stream, not a gzip stream.

If we stick to the nomenclature of zlib.h (which I think is the default implementation for deflate based compression),
there are three different stream formats based on deflate:

  • The zlib stream format
  • The gzip stream format
  • The raw deflate stream format

This is explained here.
So there is indeed a gzip stream format, which is used for .gz files.
In addition, .gzfiles contain a header and footer that is not part of the data obtained by the gzip stream. This is what I called gzip file format before.

To the best of my knowledge, the zlib library does not have a builtin function to handle gzip decompression. However it does have several useful functions for rolling your own.

This is not true. zlib can read gzip stream format AND gzip file format and it can even auto-detect which one is present by giving the correct parameters to zlib::initInflate2.
zlib can also produce streams in zlib and gzip stream format, however it does not provide functionality to produce the gzip file format (because this requires information about filenames and time, which is irrelevant for compression).

Previously we had treated GZip as an alias to Zlib. However, as pointed out by @funkey, we were doing this incorrectly and he has since fixed this issue.

Yes, zlib needs different parameters to produce a gzip stream, so you cannot simply alias the same class without changing this parameter.
However, there is no need to write out gzip file format. The java implementation in n5 works with just dumping gzip stream format.

@alimanfoo
Copy link
Member

alimanfoo commented Feb 6, 2019 via email

@constantinpape
Copy link
Author

What is n5 writing? gzip stream format?

Yes, I think it is writing gzip stream format. I have not explicitly checked this, but I assume this is the case because I can read it with zlib.h without enabling the gzip auto-detect mode, see
https://github.com/constantinpape/z5/pull/97/files#diff-8ccd797cb7fdc54a18fa755c0b912fffR105.
After enabling the autodetect mode (MAX_WBITS + 32) I can read both the output of n5 - gzip and
numcodecs.GZip (not poossible with MAX_WBITS + 16).

If so, then do we need to implement a new codec that reads and writes gzip stream format, and give that a new class and new codec ID separate from the existing GZip codec? Then numcodecs would have three different codecs, one implementing zlib stream format, one implement gzip stream format, and one implementing gzip file format?

That's certainly an option, or you could add a parameter to numcodecs.GZip for this?
Though I don't know if this runs against the design principles of numcodecs.

@alimanfoo
Copy link
Member

Anyone please correct me if I've got any of this wrong, but I think the goal is interoperability between all n5 implementations, including the zarr python package with the new N5Store adapter (call that zarr+N5Store for short), which currently uses numcodecs.GZip internally.

I.e., (1) data created by any current n5 implementation can be read by zarr+N5Store, and (2) data created by zarr+N5Store can be read by any current n5 implementation.

There is currently a mismatch between zarr+N5Store (which writes gzip file format) and other n5 implementations (which write gzip stream format).

The path of least resistance here would seem to be to get zarr+N5Store to write gzip stream format somehow.

A constraint here is that numcodecs.GZip codec is out in the wild, so we can't modify it's default behaviour. We could, however, add a parameter to make it write gzip stream format, as long as it is then able to read gzip stream format also.

@jakirkham have I got this anywhere near right?

@constantinpape
Copy link
Author

constantinpape commented Feb 7, 2019

@alimanfoo
I think you got this mostly right, some additions:

interoperability between all n5 implementations

Yes, this was my motivation behind opening this issue.
The current status with z5 is the following: the master can't read or write files compatible with numcodecs.GZip. I have been working on constantinpape/z5#97 to fix this.
Reading data from numcodecs.GZip is easy, because I just needed to adapt one function call in zlib.h,
that allows to autodetect gzip stream format vs. gzip file format. Writing files that are compatible with numcodecs.GZip would require more changes.

I am not 100 % sure about the status regarding n5-java: to the best of my knowledge, it is writing
gzip stream format (I am deriving this from the fact that z5:master can read and write files compatible with gzip compression of n5-java and not from actually checking the java implementation).

@funkey
Copy link
Contributor

funkey commented Feb 7, 2019

I am a bit confused about what exactly the problem is. @constantinpape, are you saying you can't read N5 datasets created by zarr with gzip compression with your z5 implementation?

The following seems to work fine:

import numpy as np
import z5py
import zarr

if __name__ == "__main__":

    # write with zarr, read with z5py

    zarr_file = zarr.open('test.n5', mode='w')
    zarr_ds = zarr_file.create_dataset(
        'test_zarr',
        data=np.random.random((10, 20)),
        compression='gzip'
    )
    zarr_data = zarr_ds[:]

    z5py_file = z5py.File('test.n5', mode='r')
    z5py_ds = z5py_file['test_zarr']
    z5py_data = z5py_ds[:]

    np.testing.assert_array_equal(z5py_data, zarr_data)

    # write with z5py, read with zarr

    z5py_file = z5py.File('test.n5', mode='r+')
    z5py_ds = z5py_file.create_dataset(
        'test_z5py',
        data=np.random.random((10, 20)),
        compression='gzip'
    )
    z5py_data = z5py_ds[:]

    zarr_file = zarr.open('test.n5', mode='r')
    zarr_ds = zarr_file['test_z5py']
    zarr_data = zarr_ds[:]

    np.testing.assert_array_equal(z5py_data, zarr_data)

@constantinpape
Copy link
Author

constantinpape commented Feb 7, 2019

@funkey
I am using a zarr container with gzip compression and I am writing with z5py and reading with zarr:

import numpy as np
import z5py
import zarr

# write z5py, read zarr - fails
with z5py.File('test2.zr') as f:
    f.create_dataset('test', data=np.random.random((10, 20)),
                     compression='gzip')

zr_file = zarr.open('test2.zr')
zr_ds = zr_file['test']
zrdata = zr_ds[:]

This fails with

Traceback (most recent call last):
  File "test_gzip.py", line 27, in <module>
    zrdata = zr_ds[:]
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/site-packages/zarr/core.py", line 559, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/site-packages/zarr/core.py", line 685, in get_basic_selection
    fields=fields)
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/site-packages/zarr/core.py", line 727, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/site-packages/zarr/core.py", line 1015, in _get_selection
    drop_axes=indexer.drop_axes, fields=fields)
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/site-packages/zarr/core.py", line 1597, in _chunk_getitem
    self._compressor.decode(cdata, dest)
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/site-packages/numcodecs/gzip.py", line 65, in decode
    decompressor.readinto(out_view)
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/home/cpape/Work/software/conda/miniconda3/envs/main/lib/python3.7/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'x^')

This is using zarr 2.2.0, numcodecs: 0.6.3 and z5py on branch gzip.

As I have explained, the issue is that I am not writing the gzip-file-format header.
I am assuming the same issue would occur if one writes a gzip n5 container with z5py and tries to read it with zarr (but I haven't checked this as n5 is not supported in the zarr release yet).

@funkey
Copy link
Contributor

funkey commented Feb 7, 2019

I think the problem is that your filename does not end in .n5. This is used by zarr to determine that the N5Store is to be used, which knows how to handle the N5 chunk header. I believe what you are seeing is plain Zarr trying to decompress a chunk with an N5 header. As far as I can tell, this is not related to the way that the actual gzip compression is done.

@alimanfoo
Copy link
Member

I think the problem is that your filename does not end in .n5. This is used by zarr to determine that the N5Store is to be used, which knows how to handle the N5 chunk header. I believe what you are seeing is plain Zarr trying to decompress a chunk with an N5 header. As far as I can tell, this is not related to the way that the actual gzip compression is done.

Hm, no, if I understand right then in @constantinpape's example above the problem is that z5py is writing out data in zarr format (not n5) using gzip compression, and the zarr python package should be able to read it but can't.

@constantinpape also says the same problem should occur with z5py writing n5 format and zarr+N5Store trying to read it, but mysteriously that does not seem to occur in @funkey's example.

Apologies if I am adding noise here.

@funkey
Copy link
Contributor

funkey commented Feb 7, 2019

You are right, z5py does write in zarr format based on the file extension now. So the extension is not the issue.

But this still leaves me puzzled why gzip is working fine when z5py creates an N5 dataset. Except for the N5 chunk header, there should be no difference in the chunk data.

@constantinpape
Copy link
Author

@alimanfoo Yes, this is exactly correct:

Hm, no, if I understand right then in @constantinpape's example above the problem is that z5py is writing out data in zarr format (not n5) using gzip compression, and the zarr python package should be able to read it but can't.

@funkey

But this still leaves me puzzled why gzip is working fine when z5py creates an N5 dataset. Except for the N5 chunk header, there should be no difference in the chunk data.

Yes, this is puzzling indeed. Maybe the chunk header is weirdly interacting with this?
I will try to check out zarr-developers/zarr-python#309 on the weekend and reproduce
your example.

@jakirkham
Copy link
Member

Sorry to be slow here. Was out of the country and otherwise engaged for a while.

Am happy we are to the point where we are comparing how close these 3 libraries are to generating the same file format. That seems like a win. 😄

To rewind a bit to @alimanfoo's question...

What is n5 writing?

Based on our previous investigation, N5 uses the Apache Commons Compress Library. In particular it uses these two classes, which perform both reading and writing on the gzip file format. AFAICT both N5 and Zarr are using the gzip file format and not the stream format; so, are doing the same thing.

The next step that we should probably do is generate N5 files with all 3 libraries (N5, Zarr, and z5) named clearly as such and place them somewhere (e.g. Dropbox, GitHub, etc.) where everyone can try them with the different implementations. Hopefully we can come out of this with a chart about which directions are compatible and not so we can have a more focused analysis.

Whether it is worth supporting additional formats implemented in zlib IDK. Though that's probably a separate discussion from this one, which seems mainly concerned with compatibility.

@constantinpape
Copy link
Author

Based on our previous investigation, N5 uses the Apache Commons Compress Library. In particular it uses these two classes, which perform both reading and writing on the gzip file format. AFAICT both N5 and Zarr are using the gzip file format and not the stream format; so, are doing the same thing.

Thanks for digging that up. To add some information, I know that n5-java and z5 gzip formats are compatible both ways (i.e. read / write). This is the reason why the issues I had when trying to read /write zarr gzip format confused me and I open this issue. My current suspicion is that the gzip file format written / expected by python / java is not exactly the same, but we should definitely investigate this further.

The next step that we should probably do is generate N5 files with all 3 libraries (N5, Zarr, and z5) named clearly as such and place them somewhere (e.g. Dropbox, GitHub, etc.) where everyone can try them with the different implementations. Hopefully we can come out of this with a chart about which directions are compatible and not so we can have a more focused analysis.

Yes, that's an excellent idea. Maybe open a repo for this in zarr-developers?

Whether it is worth supporting additional formats implemented in zlib IDK. Though that's probably a separate discussion from this one, which seems mainly concerned with compatibility.

I agree. If it turns out that both n5-java and zarr write gzip file format, the case for this is clear.
But first we should figure out if there are any differences first.

@constantinpape
Copy link
Author

@jakirkham I had hoped to discuss this on the call today, but we ran out of time before that.
I would vote for creating an extra repo for zarr data written by different implementations, which could be extended by inter-operability tests later.
If you open this in zarr-developers, we can raise an issue about which example data to use and I can make a PR adding the z5py data.

@constantinpape
Copy link
Author

I went ahead and created a repo to collect zarr / n5 data written by zarr and z5py:
https://github.com/constantinpape/zarr_implementations
Ofc, this can be extended for implementations in other languages.

If you want, we can transfer ownership to zarr-developers (I think I need to become a member to do this). I am also open to any changes you suggest.

Btw, I already profited from this because I found and fixed an issue with zarr edge chunks in z5py.
I will look further into gzip in the coming days.

@jakirkham
Copy link
Member

jakirkham commented Feb 23, 2019

Thanks for doing that @constantinpape.

Have written the following script to attempt loading all of these with Zarr. This could be easily modified to work with z5py instead.

Script:
import numpy
import imageio
import zarr


r = imageio.imread("data/reference_image.png")
l = ['data/n5-java.n5', 'data/z5py.n5', 'data/z5py.zr', 'data/zarr.n5', 'data/zarr.zr']
for e in l:
    print(e)
    g = zarr.open(e, mode='r')
    print(g)
    print(g.store)
    print(list(g.keys()))
    print("")
    for k in g.keys():
        print(f"{k}:")
        try:
            a = g[k]
            print(a)
            d = a[...]
        except Exception as e:
            te = type(e)
            print(f"Exception {te}: {e}")
            continue
        m = numpy.all(r == d)
        print(f"Matches reference: {m}")
    print("")
    print("")

Using the latest work from PR ( zarr-developers/zarr-python#309 ) to support loading N5 files, ran this on the data in zarr_implementations and got the following results.

Output:
data/n5-java.n5
<zarr.hierarchy.Group '/' read-only>
<zarr.n5.N5Store object at 0x104bd3128>
['bzip2', 'gzip', 'lz4', 'raw', 'xz']

bzip2:
<zarr.core.Array '/bzip2' (512, 512, 3) uint8 read-only>
Matches reference: True
gzip:
<zarr.core.Array '/gzip' (512, 512, 3) uint8 read-only>
Matches reference: True
lz4:
<zarr.core.Array '/lz4' (512, 512, 3) uint8 read-only>
Exception <class 'RuntimeError'>: LZ4 decompression error: -31
raw:
<zarr.core.Array '/raw' (512, 512, 3) uint8 read-only>
Matches reference: True
xz:
Exception <class 'RuntimeError'>: Unknown compressor with id xz


data/z5py.n5
<zarr.hierarchy.Group '/' read-only>
<zarr.n5.N5Store object at 0x10ef69518>
['gzip', 'raw']

gzip:
<zarr.core.Array '/gzip' (512, 512, 3) uint8 read-only>
Matches reference: True
raw:
<zarr.core.Array '/raw' (512, 512, 3) uint8 read-only>
Matches reference: True


data/z5py.zr
<zarr.hierarchy.Group '/' read-only>
<zarr.storage.DirectoryStore object at 0x1049bc7b8>
['blosc', 'gzip', 'raw', 'zlib']

blosc:
<zarr.core.Array '/blosc' (512, 512, 3) uint8 read-only>
Matches reference: True
gzip:
<zarr.core.Array '/gzip' (512, 512, 3) uint8 read-only>
Exception <class 'OSError'>: Not a gzipped file (b'x^')
raw:
<zarr.core.Array '/raw' (512, 512, 3) uint8 read-only>
Matches reference: True
zlib:
<zarr.core.Array '/zlib' (512, 512, 3) uint8 read-only>
Matches reference: True


data/zarr.n5
<zarr.hierarchy.Group '/' read-only>
<zarr.n5.N5Store object at 0x104999b38>
['gzip', 'raw']

gzip:
<zarr.core.Array '/gzip' (512, 512, 3) uint8 read-only>
Matches reference: True
raw:
<zarr.core.Array '/raw' (512, 512, 3) uint8 read-only>
Matches reference: True


data/zarr.zr
<zarr.hierarchy.Group '/' read-only>
<zarr.storage.DirectoryStore object at 0x1049bc2e8>
['blosc', 'gzip', 'raw', 'zlib']

blosc:
<zarr.core.Array '/blosc' (512, 512, 3) uint8 read-only>
Matches reference: True
gzip:
<zarr.core.Array '/gzip' (512, 512, 3) uint8 read-only>
Matches reference: True
raw:
<zarr.core.Array '/raw' (512, 512, 3) uint8 read-only>
Matches reference: True
zlib:
<zarr.core.Array '/zlib' (512, 512, 3) uint8 read-only>
Matches reference: True

The high level results are as follows. In nearly all cases Zarr was able to load the data without issues. If Zarr was able to load the data, it matched exactly to the reference image. In the event that Zarr was not able to load the data, it generated an exception. Here are the particular cases that Zarr failed to load.

  1. Unable to load N5 LZ4 compressed data. ( LZ4 in N5 vs Zarr #175 )
  2. Unrecognized XZ codec (This has since been fixed.)
  3. Cannot load GZIPed Zarr files from z5py (though GZIPed N5 files from z5py and N5 work).

@constantinpape
Copy link
Author

3\. Cannot load GZIPed Zarr files from z5py (though GZIPed N5 files from z5py and N5 work).

Yes, I think I will need to change the gzip compression in z5py.
But I am still a bit confused about what's going on there:
Read/write n5 to z5py works both ways, but reading gzip written with z5py from zarr does not work (other direction is ok). I want to understand why this is the case before changing my implementation, but didn't have time to look into this yet

@jakirkham
Copy link
Member

Just to make sure we are on the same page, if z5py writes an .n5 file that uses gzip compression, Zarr reads it without issue. Only when z5py writes a .zarr file with gzip compression, Zarr runs into issues.

It would be useful if others did the same thing I did above with other implementations and reported their results. Most notably would be interested in hearing how the N5 Java implementation and z5py behave.

@constantinpape
Copy link
Author

Just to make sure we are on the same page, if z5py writes an .n5 file that uses gzip compression, Zarr reads it without issue. Only when z5py writes a .zarr file with gzip compression, Zarr runs into issues.

Yes exactly.

@constantinpape
Copy link
Author

I think this has all cleared up, the issue is that I haven't implemented the full extent of the gzip data format.
I raised an issue in the z5 repo, where this is more appropriate and will try to fix this soon.
See constantinpape/z5#105.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants