-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zarr slower than npy, hdf5 etc? #519
Comments
Thanks for posting, interesting. Couple of factors may be at work here.
In the read test you use zarr in a zip file. IIUC reading from zip adds
some overhead because the CRCs are checked. I'd suggest comparing with zarr
stored as files in a directory (as you do for the write test).
Also when comparing zarr versus h5, if you wanted to be totally fair you'd
need to use the same chunk sizes and the same compressor. In your benchmark
you don't explicitly set the chunk size, so zarr and h5py will guess a
chunk size from the data size, and both use the same algorithm for guessing
chunk size but zarr is tuned to guess larger chunks. This may not make a
difference in your benchmark but thought I'd mention. Also you don't
explicitly set the compressor, and h5py and zarr have different defaults.
Again may not make a difference here, but worth noting.
…On Tue, 19 Nov 2019, 21:48 Nico Schlömer, ***@***.***> wrote:
I got interested in the performance of zarr and did a comparison with npy,
pickle, hdf5 etc. See https://stackoverflow.com/a/58942584/353337. To my
surprise, I found zarr reads large arrays with random float data slower
than npy. I had expected zarr to take the cake using multiple cores.
Perhaps this isn't a good test for zarr to show its strength either.
[image: hloNX]
<https://user-images.githubusercontent.com/181628/69189352-7e0e1d00-0b1e-11ea-8387-8bfd37a59d77.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#519?email_source=notifications&email_token=AAFLYQQL7N5CWMBCHYVFRF3QURNMLA5CNFSM4JPJ3YC2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H2PIFJA>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFLYQQIYDYQH6Y6HZQMWD3QURNMLANCNFSM4JPJ3YCQ>
.
|
I've updated the original post to include zipped zarr vs directory zarr; same results. I suppose it must be the default for the compressor then. |
I am doing a similar testing and I was trying to understand whether in zarr it is possible to read partially the file to get only some slices of the array. If I compare performance on partial retrieval with hdf5 files, where I know this is possible, I get a much longer time to read small chunks of the data (smaller than the zarr chunk size), and this seem to come from the fact that zarr reads the entire chunk to return part of it. Is there any way to get a partial file reading in zarr? |
@nschloe, are these using the same compressor and chunk size? It seems like they may not be, which could cause huge variability. |
@jakirkham I don't know; I just used the default values. See the code for how to reproduce the plot. |
Afaik, zarr uses blosc compression by default. h5py does not compress by default. FWIW when I benchmarked z5, which implements the zarr spec bin C++, I found the performance |
Can I adjust chunking and compression in |
You will need to use
Chunks needs to be a tuple with the chunk size you want. Compressor can either be
As far as I am aware there is no convenience function to read/write multi-threaded in |
To get multithreading in python, use zarr together with dask. That’s the recommended way to go.
…Sent from my iPhone
On Nov 23, 2019, at 10:05 AM, Constantin Pape ***@***.***> wrote:
Can I adjust chunking and compression in zarr.save?
You will need to use zarr.save_array instead and call it like this:
zarr.save_array('out.zr', data, chunks=chunks, compressor=compressor)
Chunks needs to be a tuple with the chunk size you want. Compressor can either be None (= no compression) or one of the compressors from numcodecs.
Can I enable multithreading in Python?
As far as I am aware there is no convenience function to read/write multi-threaded in zarr-python and you will need to implement this yourself. (I might be wrong about this though, zarr-python has grown quite a bit since I looked at it in detail for the last time ....)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I see. I suppose you're using dask arrays then, which is perhaps easy to do since they support memory segmentation with their blocked arrays. In my opinion, it's still worthwhile exploring the possibility of using multithreaded reads into numpy arrays. Those are the de-facto standard, every Python programmer knows them and lots of code is optimized for it. It should be possible, too: You know in advance how large the block in memory will be, and where each chunk will go. |
If you don’t mind the dask dependency, it would be simple to just use dask for the reading step. If you call dask.array.from_zarr and then coerce immediately to a numpy array, you will effectively get what you’re looking for.
…Sent from my iPhone
On Nov 24, 2019, at 1:29 PM, Nico Schlömer ***@***.***> wrote:
use zarr together with dask
I see. I suppose you're using dask arrays then, which is perhaps easy to do since they support memory segmentation with their blocked arrays.
In my opinion, it's still worthwhile exploring the possibility of using multithreaded reads into numpy arrays. Those are the de-facto standard, every Python programmer knows them and lots of code is optimized for it. It should be possible, too: You know in advance how large the block in memory will be, and where each chunk will go.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Would add I’m not aware of NumPy or pickle doing multithreaded reads or writes. Though am not sure if we are still discussing benchmarking. |
They don't and yes. I want to see zarr benchmarks which show file sizes smaller than pickle/npy (they already are) and (almost) equally fast reads (not possible without multithreading it seems). |
Well they are also writing a single file. I suppose one could have a Zarr file with a single chunk. Anyways I think before we jump to conclusions we need a good benchmark. The first pass gave us a starting point albeit there are issues with it as pointed out. Can we do a second pass that integrates this feedback? |
Well, how do I set the compressor? This import numpy
from numcodecs import Blosc
import zarr
data = numpy.random.rand(10)
compressor = Blosc(cname="zlib", clevel=4)
zarr.save("out.zip", data, compressor=compressor) gives the cryptic
|
What is |
A numpy array. (Edited the above code.) |
Is this still the recommendation? Maybe this issue is mature for closing? @joshmoore |
@vedal: wow. Yes. This is mature for closing, indeed! Nevertheless, for multithreading, etc. zarr-python ❤️ dask. |
Great @joshmoore! Do you happen to know of a piece of modern example code or other source for how to combine them efficiently? I saw that xarray and dask seem to work well together, but I suppose by zarr-python you mean instead of xarray. I couldnt find much about Dask in zarr-pythons official docs I noticed there is an open issue on zarr+dask as well, so it made me unsure of the maturity of the duo: #962 |
Yes, exactly.
The dask.array.from_zarr and dask.array.to_zarr methods are likely what you are looking for.
This is a good point. At the moment, Zarr is "low-level" and so doesn't try to explain how to use the other tools with it. But links from the documentation would definitely be useful. Sorry about that! We'll look into it. |
@joshmoore thanks alot for these clarifications and for being concrete! I'm surprised that zarr+dask would be alot better since xarray also seems to use dask.DataArray under-the-hood. |
I'll update the description of this issue. Here the problem is that someone tried to wrap a dask in a zarr, but you should put a zarr in your dask. 😄 |
I got interested in the performance of zarr and did a comparison with npy, pickle, hdf5 etc. See https://stackoverflow.com/a/58942584/353337. To my surprise, I found zarr reads large arrays slower than npy. This is for random float data as well as more structured mesh data. I had expected zarr to take the cake using multiple cores. Perhaps this isn't a good test for zarr to show its strength either.
Code to reproduce the plot: https://gist.github.com/nschloe/3d3b1adb9ce9e2d68d1c2d1a23ffa06d
The text was updated successfully, but these errors were encountered: