Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr slower than npy, hdf5 etc? #519

Closed
nschloe opened this issue Nov 19, 2019 · 24 comments
Closed

zarr slower than npy, hdf5 etc? #519

nschloe opened this issue Nov 19, 2019 · 24 comments

Comments

@nschloe
Copy link

nschloe commented Nov 19, 2019

I got interested in the performance of zarr and did a comparison with npy, pickle, hdf5 etc. See https://stackoverflow.com/a/58942584/353337. To my surprise, I found zarr reads large arrays slower than npy. This is for random float data as well as more structured mesh data. I had expected zarr to take the cake using multiple cores. Perhaps this isn't a good test for zarr to show its strength either.

out

Code to reproduce the plot: https://gist.github.com/nschloe/3d3b1adb9ce9e2d68d1c2d1a23ffa06d

@alimanfoo
Copy link
Member

alimanfoo commented Nov 19, 2019 via email

@nschloe
Copy link
Author

nschloe commented Nov 19, 2019

I've updated the original post to include zipped zarr vs directory zarr; same results.

I suppose it must be the default for the compressor then.

@vigji
Copy link

vigji commented Nov 22, 2019

I am doing a similar testing and I was trying to understand whether in zarr it is possible to read partially the file to get only some slices of the array. If I compare performance on partial retrieval with hdf5 files, where I know this is possible, I get a much longer time to read small chunks of the data (smaller than the zarr chunk size), and this seem to come from the fact that zarr reads the entire chunk to return part of it. Is there any way to get a partial file reading in zarr?

@jakirkham
Copy link
Member

@nschloe, are these using the same compressor and chunk size? It seems like they may not be, which could cause huge variability.

@nschloe
Copy link
Author

nschloe commented Nov 22, 2019

@jakirkham I don't know; I just used the default values. See the code for how to reproduce the plot.

@constantinpape
Copy link

I don't know; I just used the default values.

Afaik, zarr uses blosc compression by default. h5py does not compress by default.
Also, h5py does not chunk the data if you don't specify chunks=True (or enable compression).
Numpy and pickle neither compress nor chunk, I don't know about pytables.
So the comparison is not very fair.

FWIW when I benchmarked z5, which implements the zarr spec bin C++, I found the performance
on par with hdf5 in single-threaded performance and better mult-threaded. Unfortunately I don't have the results right now, the code is here.

@nschloe
Copy link
Author

nschloe commented Nov 23, 2019

Can I adjust chunking and compression in zarr.save? Can I enable multithreading in Python?

@constantinpape
Copy link

Can I adjust chunking and compression in zarr.save?

You will need to use zarr.save_array instead and call it like this:

zarr.save_array('out.zr', data, chunks=chunks, compressor=compressor)

Chunks needs to be a tuple with the chunk size you want. Compressor can either be None (= no compression) or one of the compressors from numcodecs.

Can I enable multithreading in Python?

As far as I am aware there is no convenience function to read/write multi-threaded in zarr-python and you will need to implement this yourself. (I might be wrong about this though, zarr-python has grown quite a bit since I looked at it in detail for the last time ....)

@rabernat
Copy link
Contributor

rabernat commented Nov 23, 2019 via email

@nschloe
Copy link
Author

nschloe commented Nov 24, 2019

use zarr together with dask

I see. I suppose you're using dask arrays then, which is perhaps easy to do since they support memory segmentation with their blocked arrays.

In my opinion, it's still worthwhile exploring the possibility of using multithreaded reads into numpy arrays. Those are the de-facto standard, every Python programmer knows them and lots of code is optimized for it. It should be possible, too: You know in advance how large the block in memory will be, and where each chunk will go.

@rabernat
Copy link
Contributor

rabernat commented Nov 24, 2019 via email

@jakirkham
Copy link
Member

Would add I’m not aware of NumPy or pickle doing multithreaded reads or writes. Though am not sure if we are still discussing benchmarking.

@nschloe
Copy link
Author

nschloe commented Nov 24, 2019

Would add I’m not aware of NumPy or pickle doing multithreaded reads or writes. Though am not sure if we are still discussing benchmarking.

They don't and yes. I want to see zarr benchmarks which show file sizes smaller than pickle/npy (they already are) and (almost) equally fast reads (not possible without multithreading it seems).

@jakirkham
Copy link
Member

Well they are also writing a single file. I suppose one could have a Zarr file with a single chunk.

Anyways I think before we jump to conclusions we need a good benchmark. The first pass gave us a starting point albeit there are issues with it as pointed out. Can we do a second pass that integrates this feedback?

@nschloe
Copy link
Author

nschloe commented Dec 5, 2019

Okay, well compression does make a difference after all, at least with file sizes. In particular with structured data, zarr is way ahead of the competition. The compression also explains the slower read times.

filesizes-random-int filesizes-random-float
filesizes-structured-int filesizes-structured-float

In particular with structured data, zarr is way ahead of the competition.

import os
import pickle
from pathlib import Path

import h5py
import matplotlib.pyplot as plt
import matplotx
import meshzoo
import numpy as np
import tables
import zarr


def save(data):
    np.save("out.npy", data)
    #
    f = h5py.File("out.h5", "w")
    f.create_dataset("data", data=data)
    f.close()
    #
    with open("out.pkl", "wb") as f:
        pickle.dump(data, f)
    #
    f = tables.open_file("pytables.h5", mode="w")
    gcolumns = f.create_group(f.root, "columns", "data")
    f.create_array(gcolumns, "data", data, "data")
    f.close()
    #
    zarr.save("out.zip", data)
    zarr.save("out.zarr", data)


def check_file_sizes(get_data):
    file_sizes = []
    mem_sizes = []
    for n in [2**k for k in range(24)]:
        print(n)
        data = get_data(n)
        mem_sizes.append(data.nbytes)
        save(data)
        file_sizes.append(
            [
                Path("out.npy").stat().st_size,
                Path("out.h5").stat().st_size,
                Path("out.pkl").stat().st_size,
                Path("pytables.h5").stat().st_size,
                Path("out.zip").stat().st_size,
                sum(
                    f.stat().st_size
                    for f in Path("out.zarr").glob("**/*")
                    if f.is_file()
                ),
            ]
        )

    labels = ["npy", "h5", "pickle", "pytables", "zarr zip", "zarr dir"]

    file_sizes = np.array(file_sizes).T

    # sort by last entry
    idx = np.argsort(file_sizes[:, -1])[::-1]
    file_sizes = [file_sizes[i] for i in idx]
    labels = [labels[i] for i in idx]

    with plt.style.context(matplotx.styles.dufte):
        for fs, label in zip(file_sizes, labels):
            plt.loglog(mem_sizes, fs, label=label)

        plt.xlabel("size in memory [B]")
        plt.title(f"file sizes ({get_data.name})")
        matplotx.ylabel_top("filesize [B]")
        matplotx.line_labels()

        r = "-".join(get_data.name.split(" "))
        plt.savefig(f"filesizes-{r}.png", transparent=True, bbox_inches="tight")
        # plt.show()
        plt.close()


def get_data_random_int(n):
    get_data_random_int.name = "random int"
    return np.random.randint(0, 1000, size=n)


def get_data_random_float(n):
    get_data_random_float.name = "random float"
    return np.random.rand(n)


def get_data_structured_int(n):
    get_data_structured_int.name = "structured int"
    points, cells = meshzoo.disk(6, int(np.sqrt(n)))
    return cells


def get_data_structured_float(n):
    get_data_structured_float.name = "structured float"
    points, cells = meshzoo.disk(6, int(np.sqrt(n)))
    return points


for f in [
    get_data_random_int,
    get_data_random_float,
    get_data_structured_int,
    get_data_structured_float,
]:
    check_file_sizes(f)

@nschloe
Copy link
Author

nschloe commented Dec 5, 2019

Well, how do I set the compressor? This

import numpy
from numcodecs import Blosc
import zarr

data = numpy.random.rand(10)

compressor = Blosc(cname="zlib", clevel=4)
zarr.save("out.zip", data, compressor=compressor)

gives the cryptic

ValueError: missing object_codec for object array

@jakirkham
Copy link
Member

What is data?

@nschloe
Copy link
Author

nschloe commented Dec 5, 2019

A numpy array. (Edited the above code.)

@vedal
Copy link

vedal commented Nov 14, 2022

@rabernat

To get multithreading in python, use zarr together with dask. That’s the recommended way to go.

Is this still the recommendation? Maybe this issue is mature for closing? @joshmoore

@joshmoore
Copy link
Member

@vedal: wow. Yes. This is mature for closing, indeed! Nevertheless, for multithreading, etc. zarr-python ❤️ dask.

@vedal
Copy link

vedal commented Nov 14, 2022

@vedal: wow. Yes. This is mature for closing, indeed! Nevertheless, for multithreading, etc. zarr-python ❤️ dask.

Great @joshmoore! Do you happen to know of a piece of modern example code or other source for how to combine them efficiently? I saw that xarray and dask seem to work well together, but I suppose by zarr-python you mean instead of xarray. I couldnt find much about Dask in zarr-pythons official docs

I noticed there is an open issue on zarr+dask as well, so it made me unsure of the maturity of the duo: #962

@joshmoore
Copy link
Member

I suppose by zarr-python you mean instead of xarray

Yes, exactly.

Do you happen to know of a piece of modern example code or other source for how to combine them efficiently?

The dask.array.from_zarr and dask.array.to_zarr methods are likely what you are looking for.

I couldnt find much about Dask in zarr-pythons official docs

This is a good point. At the moment, Zarr is "low-level" and so doesn't try to explain how to use the other tools with it. But links from the documentation would definitely be useful. Sorry about that! We'll look into it.

@vedal
Copy link

vedal commented Nov 15, 2022

@joshmoore thanks alot for these clarifications and for being concrete! I'm surprised that zarr+dask would be alot better since xarray also seems to use dask.DataArray under-the-hood.
I will look into this further. Have a very nice day :)

@joshmoore
Copy link
Member

I noticed there is an open issue on zarr+dask as well, so it made me unsure of the maturity of the duo: #962

I'll update the description of this issue. Here the problem is that someone tried to wrap a dask in a zarr, but you should put a zarr in your dask. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants