Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression with 3.0.0rc1: reading zarr with tensorstore #2647

Closed
psobolewskiPhD opened this issue Jan 4, 2025 · 14 comments · Fixed by #2655
Closed

Regression with 3.0.0rc1: reading zarr with tensorstore #2647

psobolewskiPhD opened this issue Jan 4, 2025 · 14 comments · Fixed by #2655
Labels
bug Potential issues with the zarr-python library
Milestone

Comments

@psobolewskiPhD
Copy link

psobolewskiPhD commented Jan 4, 2025

Zarr version

3.0.0rc1

Numcodecs version

0.14.1

Python Version

3.12

Operating System

macOS, but also all on CI

Installation

pip

Description

With 3.0.0rc1 we observed some fails in napari --pre tests. Juan is already addressing one, but there is also a second one:
https://github.com/napari/napari/actions/runs/12610293301/job/35144717871#step:9:369

Here is a script that recapitulates the failure without napari or our pytest parameterization. This script below works with zarr<3 and zarr 3.0.0b3, but fails with rc1
I've tested tensorstore 0.1.69 and 0.1.71 and numcodecs 0.13.1 and 0.14.1 with the same behavior.

ValueError: FAILED_PRECONDITION: Error opening "zarr" driver: Error reading local file "/tmp/labels.zarr/.zarray": Error parsing object member "compressor": Object includes extra members: "checksum" [source locations='tensorstore/internal/json_binding/json_binding.h:830\ntensorstore/internal/json_binding/json_binding.h:865\ntensorstore/internal/json_binding/json_binding.h:830\ntensorstore/internal/json_binding/json_binding.h:388\ntensorstore/driver/zarr/driver.cc:108\ntensorstore/driver/kvs_backed_chunk_driver.cc:1162\ntensorstore/internal/cache/kvs_backed_cache.h:208\ntensorstore/driver/driver.cc:112'] [tensorstore_spec='{\"context\":{\"cache_pool\":{},\"data_copy_concurrency\":{},\"file_io_concurrency\":{},\"file_io_locking\":{},\"file_io_memmap\":false,\"file_io_sync\":true},\"driver\":\"zarr\",\"kvstore\":{\"driver\":\"file\",\"path\":\"/tmp/labels.zarr/\"}}']

Steps to reproduce

import zarr
import numpy as np
import tensorstore as ts
from pathlib import Path

tmp_path = Path('/tmp')
zarr_version = 2
zarr_driver = 'zarr'
labels = np.zeros((5, 7, 8, 9), dtype=int)
labels[1, 2:4, 4:6, 4:6] = 1
labels[1, 3:5, 5:7, 6:8] = 2
labels[2, 3:5, 5:7, 6:8] = 3

file_path = str(tmp_path / 'labels.zarr')

labels_temp = zarr.open(
    store=file_path,
    mode='w',
    shape=labels.shape,
    dtype=np.uint32,
    chunks=(1, 1, 8, 9),
    zarr_version=zarr_version,
)
labels_temp[:] = labels
labels_ts_spec = {
    'driver': zarr_driver,
    'kvstore': {'driver': 'file', 'path': file_path},
    'path': '',
}
data = ts.open(labels_ts_spec, create=False, open=True).result()

Additional output

No response

@psobolewskiPhD psobolewskiPhD added the bug Potential issues with the zarr-python library label Jan 4, 2025
@psobolewskiPhD psobolewskiPhD changed the title Regression with 3.0.0rc1 Regression with 3.0.0rc1: reading zarr with tensorstore Jan 4, 2025
@jhamman
Copy link
Member

jhamman commented Jan 4, 2025

Thanks for the nice reproducer. The array metadata being produced here looks like:

{
  "shape": [
    5,
    7,
    8,
    9
  ],
  "chunks": [
    1,
    1,
    8,
    9
  ],
  "fill_value": 0,
  "order": "C",
  "filters": null,
  "dimension_separator": ".",
  "compressor": {
    "id": "zstd",
    "level": 0,
    "checksum": false
  },
  "zarr_format": 2,
  "dtype": "<u4"
}

Based on the error, tensorstore doesn't seem to like the checksum flag in the compressor args. Not sure if this is a Zarr-Python problem or a Tensorstore problem. The reason this wasn't triggered earlier is that we recently enabled Zstd as the default compressor.

@psobolewskiPhD
Copy link
Author

psobolewskiPhD commented Jan 4, 2025

Looking at TS code, it looks like they support zstd checksum for zarr3:
https://github.com/google/tensorstore/blob/63417dfc6c81c6fe8884f9d2c62b6fd45966c551/tensorstore/driver/zarr3/codec/zstd_codec.cc#L41
Our test with zarr = 3 , zarr_diver = 'zarr3' passes.
napari/layers/labels/_tests/test_labels.py::test_fill_tensorstore[3-zarr3] PASSED
For zarr (zarr v2), TS has only level:
https://github.com/google/tensorstore/blob/63417dfc6c81c6fe8884f9d2c62b6fd45966c551/tensorstore/driver/zarr/zstd_compressor.cc#L36

Edit: so at the end of the day i'm not sure where this issue belongs. I guess we can use something other than zstd in the test and it should be fine?

@jbms
Copy link

jbms commented Jan 5, 2025

I'd consider this a zarr-python bug --- I proposed a zarr v3 zstd codec with a checksum parameter that is implemented in TensorStore, but the zarr v2 zstd codec in TensorStore follows the zstd codec in numcodecs (at the time it was implemented) and lacks support for a checksum parameter.

@jni
Copy link
Contributor

jni commented Jan 5, 2025

To help us fix the napari side, can someone point to a compressor= dict that we can pass to zarr.open that will work on zarr-python 2.x and 3.x with zarr v2 and v3 arrays? 🙏

@normanrz
Copy link
Member

normanrz commented Jan 5, 2025

I'd consider this a zarr-python bug --- I proposed a zarr v3 zstd codec with a checksum parameter that is implemented in TensorStore, but the zarr v2 zstd codec in TensorStore follows the zstd codec in numcodecs (at the time it was implemented) and lacks support for a checksum parameter.

zstd in numcodecs has since evolved and now also supports the checksum parameter when used as a v2 codec. See https://numcodecs.readthedocs.io/en/stable/release.html#id8

I acknowledge that a major problem with v2 is that we don't have a good spec process for codecs, apart from following what numcodecs does.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 5, 2025

To help us fix the napari side, can someone point to a compressor= dict that we can pass to zarr.open that will work on zarr-python 2.x and 3.x with zarr v2 and v3 arrays? 🙏

I'm not sure there is 1 dict that will work for v2 and v3 🙃 !

Please someone correct me here. Zarr v2 and v3 use different JSON schemas for codecs; the zarr v2 spec requires that codec dicts take the form {"id": <str>, **params}, while the v3 spec makes no formal requirements of the JSON structure of codecs, and even states that the v3 spec defines no codecs at all, seemingly leaving this to extensions.

Nonetheless, there are codecs defined "with" the v3 spec, including some compression routines used in zarr v2 (GZip, and Blosc, Zstd is submitted as a PR) .... but as all the v3-flavored codec JSON documents take the form {"name": <str>, "configuration": {**params}}, they are incompatible with zarr v2. Accordingly, inside zarr-python we distinguish the "zarr v2 gzip" from the "zarr v3 gzip". I think if this is a big problem for users, then we might have to reconsider that distinction.

@dstansby
Copy link
Contributor

dstansby commented Jan 5, 2025

zstd in numcodecs has since evolved and now also supports the checksum parameter when used as a v2 codec. See numcodecs.readthedocs.io/en/stable/release.html#id8

Is the pragmatic solution here to revert this, to avoid producing v2 data that has different metadata (the extra checksum field) that has been the status quo for a long time, and (somehow) just include the checksum field in the v3 codec output?

@dstansby
Copy link
Contributor

dstansby commented Jan 5, 2025

Or perhaps do some special casing here in zarr-python before serialising with zstd that looks like:

if zarr_format == 2 and compressor == zstd:
    assert checksum == False
    compressor_dict.pop('checksum')

@dstansby dstansby added this to the 3.0.0 milestone Jan 5, 2025
@psobolewskiPhD
Copy link
Author

psobolewskiPhD commented Jan 5, 2025

I can't easily check at the moment, but can zarr-python 2.x open these zarr-python 3.x zarr_format = 2 produced files?
Or does that depend on the numcodecs installed? My concern relates to a containerized pipeline, where updating isn't trivial due to various transitive dependencies.

@normanrz
Copy link
Member

normanrz commented Jan 5, 2025

That should depend on the version of numcodecs.

@psobolewskiPhD
Copy link
Author

psobolewskiPhD commented Jan 5, 2025

So older than numcodecs 13, means no go? 😬

@dstansby
Copy link
Contributor

dstansby commented Jan 5, 2025

With numcodecs 0.13.1, writing with zarr 3 and then reading with zarr 2 seems to work fine:

import zarr

if zarr.__version__[0] == "3":
    print("writing array")
    arr = zarr.create_array(
        "test_data.zarr",
        shape=(100, 100),
        chunks=(10, 10),
        dtype="int32",
        zarr_format=2,
        overwrite=True,
    )
    arr[:] = 0

print(zarr.__version__)
arr = zarr.open_array("test_data.zarr")
print(arr[0, 0])

@jbms
Copy link

jbms commented Jan 5, 2025

I would say that zarr-python should definitely not include checksum=false in the zarr v2 json metadata since that breaks other implementations that don't support the checksum parameter for zarr v2, including older versions of numcodecs itself.

I'm a bit wary of adding support for this new parameter in zarr v2 at all since it creates compatibility issues but it is less of an issue if it requires users to opt in explicitly.

@jni
Copy link
Contributor

jni commented Jan 5, 2025

I'm not sure there is 1 dict that will work for v2 and v3 🙃 !

Fixed by setting compressor=None 😂😭😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants