-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numcodecs.Gzip can't read files written with zlib in the gzip format #169
Comments
Have you read the discussion in PR ( #87 ) already? |
Thanks for the pointer, I wasn't aware of the discussion, but I had a quick look now. |
It sounds like there are some things that are getting confused. The stream of compressed data is a deflate stream, not a gzip stream. The Zlib codec simply provides deflate if that’s what you want. OTOH the GZip codec creates files that are compatible with the gzip format (this seemed to have the shortest, readable copy of the spec), which require a header and a footer with some specific information that wraps the deflate stream. To the best of my knowledge, the zlib library does not have a builtin function to handle gzip decompression. However it does have several useful functions for rolling your own. Previously we had treated GZip as an alias to Zlib. However, as pointed out by @funkey, we were doing this incorrectly and he has since fixed this issue. The CPython implementation of gzip looks similar to the Java implementation that N5 is using. We have used the resulting changes with the gzip command line tools successfully (not possible with our previous implementation). Personally I’ve used the same strategy successfully with other libraries that handle gzip. |
If we stick to the nomenclature of zlib.h (which I think is the default implementation for deflate based compression),
This is explained here.
This is not true. zlib can read gzip stream format AND gzip file format and it can even auto-detect which one is present by giving the correct parameters to
Yes, zlib needs different parameters to produce a gzip stream, so you cannot simply alias the same class without changing this parameter. |
What is n5 writing? gzip stream format? If so, then numcodecs needs a codec
that can read and write gzip stream format if we want compatibility?
If so, then do we need to implement a new codec that reads and writes gzip
stream format, and give that a new class and new codec ID separate from the
existing GZip codec? Then numcodecs would have three different codecs, one
implementing zlib stream format, one implement gzip stream format, and one
implementing gzip file format?
Or are there other options we should consider?
…On Sun, 3 Feb 2019 at 08:50, Constantin Pape ***@***.***> wrote:
The stream of compressed data is a deflate stream, not a gzip stream.
If we stick to the nomenclature of zlib (which I think is the default
implementation for all this),
there are three different stream formats based on deflate:
- The zlib stream format
- The gzip stream format
- The raw deflate stream format
This is explained here
<https://github.com/madler/zlib/blob/master/zlib.h#L59-L69>.
So there is indeed a gzip stream format, which is used for .gz files.
In addition, .gzfiles contain a header and footer that is not part of
the data obtained by the gzip stream. This is what I called gzip file
format before.
To the best of my knowledge, the zlib library does not have a builtin
function to handle gzip decompression. However it does have several useful
functions for rolling your own.
This is not true. zlib can read gzip stream format AND gzip file format
and it can even auto-detect which one is present by giving the correct
parameters to zlib::initInflate2.
zlib can also produce streams in zlib and gzip stream format, however it
does not provide functionality to produce the gzip file format (because
this requires information about filenames and write-time, which is
irrelevant for compression).
Previously we had treated GZip as an alias to Zlib. However, as pointed
out by @funkey <https://github.com/funkey>, we were doing this
incorrectly and he has since fixed this issue.
Yes, Zlib needs different parameters to produce a gzip stream, so you
cannot simply alias it.
However, there is no need to write out gzip file format.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#169 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QrzSzc78RHE_NsxkvOm1gl_13DX2ks5vJqLEgaJpZM4afjcL>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
Yes, I think it is writing gzip stream format. I have not explicitly checked this, but I assume this is the case because I can read it with
That's certainly an option, or you could add a parameter to |
Anyone please correct me if I've got any of this wrong, but I think the goal is interoperability between all n5 implementations, including the zarr python package with the new N5Store adapter (call that zarr+N5Store for short), which currently uses numcodecs.GZip internally. I.e., (1) data created by any current n5 implementation can be read by zarr+N5Store, and (2) data created by zarr+N5Store can be read by any current n5 implementation. There is currently a mismatch between zarr+N5Store (which writes gzip file format) and other n5 implementations (which write gzip stream format). The path of least resistance here would seem to be to get zarr+N5Store to write gzip stream format somehow. A constraint here is that @jakirkham have I got this anywhere near right? |
@alimanfoo
Yes, this was my motivation behind opening this issue. I am not 100 % sure about the status regarding |
I am a bit confused about what exactly the problem is. @constantinpape, are you saying you can't read N5 datasets created by The following seems to work fine:
|
@funkey
This fails with
This is using As I have explained, the issue is that I am not writing the gzip-file-format header. |
I think the problem is that your filename does not end in |
Hm, no, if I understand right then in @constantinpape's example above the problem is that z5py is writing out data in zarr format (not n5) using gzip compression, and the zarr python package should be able to read it but can't. @constantinpape also says the same problem should occur with z5py writing n5 format and zarr+N5Store trying to read it, but mysteriously that does not seem to occur in @funkey's example. Apologies if I am adding noise here. |
You are right, But this still leaves me puzzled why |
@alimanfoo Yes, this is exactly correct:
Yes, this is puzzling indeed. Maybe the chunk header is weirdly interacting with this? |
Sorry to be slow here. Was out of the country and otherwise engaged for a while. Am happy we are to the point where we are comparing how close these 3 libraries are to generating the same file format. That seems like a win. 😄 To rewind a bit to @alimanfoo's question...
Based on our previous investigation, N5 uses the Apache Commons Compress Library. In particular it uses these two classes, which perform both reading and writing on the gzip file format. AFAICT both N5 and Zarr are using the gzip file format and not the stream format; so, are doing the same thing. The next step that we should probably do is generate N5 files with all 3 libraries (N5, Zarr, and z5) named clearly as such and place them somewhere (e.g. Dropbox, GitHub, etc.) where everyone can try them with the different implementations. Hopefully we can come out of this with a chart about which directions are compatible and not so we can have a more focused analysis. Whether it is worth supporting additional formats implemented in zlib IDK. Though that's probably a separate discussion from this one, which seems mainly concerned with compatibility. |
Thanks for digging that up. To add some information, I know that n5-java and z5 gzip formats are compatible both ways (i.e. read / write). This is the reason why the issues I had when trying to read /write zarr gzip format confused me and I open this issue. My current suspicion is that the gzip file format written / expected by python / java is not exactly the same, but we should definitely investigate this further.
Yes, that's an excellent idea. Maybe open a repo for this in zarr-developers?
I agree. If it turns out that both n5-java and zarr write gzip file format, the case for this is clear. |
@jakirkham I had hoped to discuss this on the call today, but we ran out of time before that. |
I went ahead and created a repo to collect zarr / n5 data written by zarr and z5py: If you want, we can transfer ownership to zarr-developers (I think I need to become a member to do this). I am also open to any changes you suggest. Btw, I already profited from this because I found and fixed an issue with zarr edge chunks in z5py. |
Thanks for doing that @constantinpape. Have written the following script to attempt loading all of these with Zarr. This could be easily modified to work with z5py instead. Script:import numpy
import imageio
import zarr
r = imageio.imread("data/reference_image.png")
l = ['data/n5-java.n5', 'data/z5py.n5', 'data/z5py.zr', 'data/zarr.n5', 'data/zarr.zr']
for e in l:
print(e)
g = zarr.open(e, mode='r')
print(g)
print(g.store)
print(list(g.keys()))
print("")
for k in g.keys():
print(f"{k}:")
try:
a = g[k]
print(a)
d = a[...]
except Exception as e:
te = type(e)
print(f"Exception {te}: {e}")
continue
m = numpy.all(r == d)
print(f"Matches reference: {m}")
print("")
print("") Using the latest work from PR ( zarr-developers/zarr-python#309 ) to support loading N5 files, ran this on the data in zarr_implementations and got the following results. Output:
The high level results are as follows. In nearly all cases Zarr was able to load the data without issues. If Zarr was able to load the data, it matched exactly to the reference image. In the event that Zarr was not able to load the data, it generated an exception. Here are the particular cases that Zarr failed to load.
|
Yes, I think I will need to change the gzip compression in z5py. |
Just to make sure we are on the same page, if z5py writes an It would be useful if others did the same thing I did above with other implementations and reported their results. Most notably would be interested in hearing how the N5 Java implementation and z5py behave. |
Yes exactly. |
I think this has all cleared up, the issue is that I haven't implemented the full extent of the gzip data format. |
The
numcodecs.Gzip
codec can't read files that were produced by using thezlib
c-api to obtain agzip
compressed file. It fails withThis is due to the fact that
numcodecs.Gzip
uses the pythongzip
library to read and write files. This library useszlib
internally, however it adds additional bytes to the header and it expects these to be present when reading.These bytes are not present when producing a
gzip
stream via thezlib
c-api:They are rather part of the
gzip
file format produced by the unixgzip
command.I would propose to not use python
gzip
, but rather use pythonzlib
and use it for compression and decompression togzip
compatible format.Note that this should be backward compatible, because
zlib
can read files written by unixgzip
.I have only tested this for the
zlib
c-api, not for python, yet.The text was updated successfully, but these errors were encountered: