Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read performance regression via Store backed by FSMap (fsspec, GCS) #1296

Closed
ravwojdyla opened this issue Dec 15, 2022 · 7 comments
Closed
Labels
bug Potential issues with the zarr-python library

Comments

@ravwojdyla
Copy link
Contributor

Zarr version

2.11.0 and up (including current main)

Numcodecs version

0.10.2

Python Version

3.10

Operating System

Mac and Linux

Installation

conda, pip and from source

Description

It appears that #789, commit: 5c71212 so from zarr 0.11.0, there's a performance regression that affects reading zarr data via Store backed by fsspec/FSMap.

In our test example (in practice we use xarray), we have a zarr array made of 2K files (total 1GB compressed), reading it via:

np.asarray(zarr.open(fsspec.get_mapper(...), mode="r"))
  • on zarr 0.10.3 took about 12 seconds
  • on zarr 0.13.3 took about 90 seconds (so roughly 7x longer)
    • the same problem exists starting from version 0.11.0

Looking at the stacktraces from the different versions, looks like 0.10.3 was asynchronous fetching multiple items, while 0.13.3 is synchronized per storage item?

zarr 0.13.3
Thread 3529282 (idle): "MainThread"
    do_futex_wait.constprop.0 (libpthread-2.31.so)
    __new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
    PyThread_acquire_lock_timed.localalias (python3.10)
    lock_PyThread_acquire_lock (python3.10)
    wait (threading.py:324)
    wait (threading.py:607)
    sync (fsspec/asyn.py:86)
    wrapper (fsspec/asyn.py:113)
    __getitem__ (fsspec/mapping.py:143)
    __getitem__ (zarr/storage.py:724)
    _chunk_getitem (zarr/core.py:1966)
    _get_selection (zarr/core.py:1267)
    _get_basic_selection_nd (zarr/core.py:976)
    get_basic_selection (zarr/core.py:933)
    __getitem__ (zarr/core.py:807)
    __array__ (zarr/core.py:589)
    PyArray_FromArrayAttr_int (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    _array_from_array_like (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    PyArray_DiscoverDTypeAndShape_Recursive (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    PyArray_DiscoverDTypeAndShape (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    PyArray_FromAny (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    PyArray_CheckFromAny (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    _array_fromobject_generic (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    array_asarray (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    <module> (<stdin>:2)
zarr 0.10.3
Thread 3530062 (idle): "MainThread"
    do_futex_wait.constprop.0 (libpthread-2.31.so)
    __new_sem_wait_slow.constprop.0 (libpthread-2.31.so)
    PyThread_acquire_lock_timed.localalias (python3.10)
    lock_PyThread_acquire_lock (python3.10)
    wait (threading.py:324)
    wait (threading.py:607)
    sync (fsspec/asyn.py:86)
    wrapper (fsspec/asyn.py:113)
    getitems (fsspec/mapping.py:93)
    _chunk_getitems (zarr/core.py:1847)
    _get_selection (zarr/core.py:1136)
    _get_basic_selection_nd (zarr/core.py:841)
    get_basic_selection (zarr/core.py:798)
    __getitem__ (zarr/core.py:673)
    __array__ (zarr/core.py:469)
    PyArray_FromArrayAttr_int (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    _array_from_array_like (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    PyArray_DiscoverDTypeAndShape_Recursive (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    PyArray_DiscoverDTypeAndShape (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    PyArray_FromAny (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    PyArray_CheckFromAny (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    _array_fromobject_generic (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    array_asarray (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
    <module> (<stdin>:2)

Steps to reproduce

And we need to an existing zarr array to read:

np.asarray(zarr.open(fsspec.get_mapper(...), mode="r"))

Additional output

No response

@ravwojdyla ravwojdyla added the bug Potential issues with the zarr-python library label Dec 15, 2022
@rabernat
Copy link
Contributor

@ravwojdyla I am very interested in optimizing the performance of this code path. My first recommendation would be to try your test instead with

store = zarr.FSStore(...)
np.asarray(zarr.open(store))

rather than fsspec.get_mapper. Does this make any difference?

@ravwojdyla
Copy link
Contributor Author

@rabernat thanks for a quick response. Yes that will bypass the issue the same way reading via np.asarray(zarr.open(<path>)) would also avoid this issue.

@rabernat
Copy link
Contributor

Ok, thanks for confirming that the problem goes away if you pass an FSStore rather than an fsspec.FSMap.

We could resolve the regression by automatically promoting an FSMap to an FSStore. We discussed this a bit in #911 (comment). I believe that the only blocker to that idea was resolved in fsspec/filesystem_spec#939.

@ravwojdyla
Copy link
Contributor Author

@rabernat that sounds great! +1

@rabernat
Copy link
Contributor

Now this the time when I try to nerd-swipe you into making the PR yourself! 🤓

Basically, we just need to add a block in this function (and also in the equivalent V3 version):

zarr-python/zarr/storage.py

Lines 132 to 156 in e7c0eb4

def _normalize_store_arg_v2(store: Any, storage_options=None, mode="r") -> BaseStore:
# default to v2 store for backward compatibility
zarr_version = getattr(store, '_store_version', 2)
if zarr_version != 2:
raise ValueError("store must be a version 2 store")
if store is None:
store = KVStore(dict())
return store
if isinstance(store, os.PathLike):
store = os.fspath(store)
if isinstance(store, str):
if "://" in store or "::" in store:
return FSStore(store, mode=mode, **(storage_options or {}))
elif storage_options:
raise ValueError("storage_options passed with non-fsspec path")
if store.endswith('.zip'):
return ZipStore(store, mode=mode)
elif store.endswith('.n5'):
from zarr.n5 import N5Store
return N5Store(store)
else:
return DirectoryStore(store)
else:
store = Store._ensure_store(store)
return store

which checks isinstance(store, fsspec.FSMap) and then uses the FSMap attributes to initialize an FSStore.

Is this something you think you could take on? We would be happy to help guide you through your first PR to Zarr. We are always looking for new contributors, and it seems like you understand this stuff pretty well already.

@ravwojdyla
Copy link
Contributor Author

@rabernat well played :P Sure, will give it a try. Thank for the context ^

@rabernat
Copy link
Contributor

rabernat commented Feb 2, 2023

This was closed by #1304.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

2 participants