Regression with 3.0.0rc1: reading zarr with tensorstore #2647

psobolewskiPhD · 2025-01-04T14:54:14Z

Zarr version

3.0.0rc1

Numcodecs version

0.14.1

Python Version

3.12

Operating System

macOS, but also all on CI

Installation

pip

Description

With 3.0.0rc1 we observed some fails in napari --pre tests. Juan is already addressing one, but there is also a second one:
https://github.com/napari/napari/actions/runs/12610293301/job/35144717871#step:9:369

Here is a script that recapitulates the failure without napari or our pytest parameterization. This script below works with zarr<3 and zarr 3.0.0b3, but fails with rc1
I've tested tensorstore 0.1.69 and 0.1.71 and numcodecs 0.13.1 and 0.14.1 with the same behavior.

ValueError: FAILED_PRECONDITION: Error opening "zarr" driver: Error reading local file "/tmp/labels.zarr/.zarray": Error parsing object member "compressor": Object includes extra members: "checksum" [source locations='tensorstore/internal/json_binding/json_binding.h:830\ntensorstore/internal/json_binding/json_binding.h:865\ntensorstore/internal/json_binding/json_binding.h:830\ntensorstore/internal/json_binding/json_binding.h:388\ntensorstore/driver/zarr/driver.cc:108\ntensorstore/driver/kvs_backed_chunk_driver.cc:1162\ntensorstore/internal/cache/kvs_backed_cache.h:208\ntensorstore/driver/driver.cc:112'] [tensorstore_spec='{\"context\":{\"cache_pool\":{},\"data_copy_concurrency\":{},\"file_io_concurrency\":{},\"file_io_locking\":{},\"file_io_memmap\":false,\"file_io_sync\":true},\"driver\":\"zarr\",\"kvstore\":{\"driver\":\"file\",\"path\":\"/tmp/labels.zarr/\"}}']

Steps to reproduce

import zarr
import numpy as np
import tensorstore as ts
from pathlib import Path

tmp_path = Path('/tmp')
zarr_version = 2
zarr_driver = 'zarr'
labels = np.zeros((5, 7, 8, 9), dtype=int)
labels[1, 2:4, 4:6, 4:6] = 1
labels[1, 3:5, 5:7, 6:8] = 2
labels[2, 3:5, 5:7, 6:8] = 3

file_path = str(tmp_path / 'labels.zarr')

labels_temp = zarr.open(
    store=file_path,
    mode='w',
    shape=labels.shape,
    dtype=np.uint32,
    chunks=(1, 1, 8, 9),
    zarr_version=zarr_version,
)
labels_temp[:] = labels
labels_ts_spec = {
    'driver': zarr_driver,
    'kvstore': {'driver': 'file', 'path': file_path},
    'path': '',
}
data = ts.open(labels_ts_spec, create=False, open=True).result()

Additional output

No response

The text was updated successfully, but these errors were encountered:

jhamman · 2025-01-04T19:48:55Z

Thanks for the nice reproducer. The array metadata being produced here looks like:

{
  "shape": [
    5,
    7,
    8,
    9
  ],
  "chunks": [
    1,
    1,
    8,
    9
  ],
  "fill_value": 0,
  "order": "C",
  "filters": null,
  "dimension_separator": ".",
  "compressor": {
    "id": "zstd",
    "level": 0,
    "checksum": false
  },
  "zarr_format": 2,
  "dtype": "<u4"
}

Based on the error, tensorstore doesn't seem to like the checksum flag in the compressor args. Not sure if this is a Zarr-Python problem or a Tensorstore problem. The reason this wasn't triggered earlier is that we recently enabled Zstd as the default compressor.

psobolewskiPhD · 2025-01-04T20:23:54Z

Looking at TS code, it looks like they support zstd checksum for zarr3:
https://github.com/google/tensorstore/blob/63417dfc6c81c6fe8884f9d2c62b6fd45966c551/tensorstore/driver/zarr3/codec/zstd_codec.cc#L41
Our test with zarr = 3 , zarr_diver = 'zarr3' passes.
napari/layers/labels/_tests/test_labels.py::test_fill_tensorstore[3-zarr3] PASSED
For zarr (zarr v2), TS has only level:
https://github.com/google/tensorstore/blob/63417dfc6c81c6fe8884f9d2c62b6fd45966c551/tensorstore/driver/zarr/zstd_compressor.cc#L36

Edit: so at the end of the day i'm not sure where this issue belongs. I guess we can use something other than zstd in the test and it should be fine?

jbms · 2025-01-05T00:32:45Z

I'd consider this a zarr-python bug --- I proposed a zarr v3 zstd codec with a checksum parameter that is implemented in TensorStore, but the zarr v2 zstd codec in TensorStore follows the zstd codec in numcodecs (at the time it was implemented) and lacks support for a checksum parameter.

jni · 2025-01-05T00:38:47Z

To help us fix the napari side, can someone point to a compressor= dict that we can pass to zarr.open that will work on zarr-python 2.x and 3.x with zarr v2 and v3 arrays? 🙏

normanrz · 2025-01-05T07:43:10Z

I'd consider this a zarr-python bug --- I proposed a zarr v3 zstd codec with a checksum parameter that is implemented in TensorStore, but the zarr v2 zstd codec in TensorStore follows the zstd codec in numcodecs (at the time it was implemented) and lacks support for a checksum parameter.

zstd in numcodecs has since evolved and now also supports the checksum parameter when used as a v2 codec. See https://numcodecs.readthedocs.io/en/stable/release.html#id8

I acknowledge that a major problem with v2 is that we don't have a good spec process for codecs, apart from following what numcodecs does.

d-v-b · 2025-01-05T15:11:16Z

To help us fix the napari side, can someone point to a compressor= dict that we can pass to zarr.open that will work on zarr-python 2.x and 3.x with zarr v2 and v3 arrays? 🙏

I'm not sure there is 1 dict that will work for v2 and v3 🙃 !

Please someone correct me here. Zarr v2 and v3 use different JSON schemas for codecs; the zarr v2 spec requires that codec dicts take the form {"id": <str>, **params}, while the v3 spec makes no formal requirements of the JSON structure of codecs, and even states that the v3 spec defines no codecs at all, seemingly leaving this to extensions.

Nonetheless, there are codecs defined "with" the v3 spec, including some compression routines used in zarr v2 (GZip, and Blosc, Zstd is submitted as a PR) .... but as all the v3-flavored codec JSON documents take the form {"name": <str>, "configuration": {**params}}, they are incompatible with zarr v2. Accordingly, inside zarr-python we distinguish the "zarr v2 gzip" from the "zarr v3 gzip". I think if this is a big problem for users, then we might have to reconsider that distinction.

dstansby · 2025-01-05T16:04:54Z

zstd in numcodecs has since evolved and now also supports the checksum parameter when used as a v2 codec. See numcodecs.readthedocs.io/en/stable/release.html#id8

Is the pragmatic solution here to revert this, to avoid producing v2 data that has different metadata (the extra checksum field) that has been the status quo for a long time, and (somehow) just include the checksum field in the v3 codec output?

dstansby · 2025-01-05T16:07:27Z

Or perhaps do some special casing here in zarr-python before serialising with zstd that looks like:

if zarr_format == 2 and compressor == zstd:
    assert checksum == False
    compressor_dict.pop('checksum')

psobolewskiPhD · 2025-01-05T18:46:08Z

I can't easily check at the moment, but can zarr-python 2.x open these zarr-python 3.x zarr_format = 2 produced files?
Or does that depend on the numcodecs installed? My concern relates to a containerized pipeline, where updating isn't trivial due to various transitive dependencies.

normanrz · 2025-01-05T19:09:56Z

That should depend on the version of numcodecs.

psobolewskiPhD · 2025-01-05T19:18:23Z

So older than numcodecs 13, means no go? 😬

dstansby · 2025-01-05T20:33:33Z

With numcodecs 0.13.1, writing with zarr 3 and then reading with zarr 2 seems to work fine:

import zarr

if zarr.__version__[0] == "3":
    print("writing array")
    arr = zarr.create_array(
        "test_data.zarr",
        shape=(100, 100),
        chunks=(10, 10),
        dtype="int32",
        zarr_format=2,
        overwrite=True,
    )
    arr[:] = 0

print(zarr.__version__)
arr = zarr.open_array("test_data.zarr")
print(arr[0, 0])

jbms · 2025-01-05T21:38:20Z

I would say that zarr-python should definitely not include checksum=false in the zarr v2 json metadata since that breaks other implementations that don't support the checksum parameter for zarr v2, including older versions of numcodecs itself.

I'm a bit wary of adding support for this new parameter in zarr v2 at all since it creates compatibility issues but it is less of an issue if it requires users to opt in explicitly.

jni · 2025-01-05T23:52:23Z

I'm not sure there is 1 dict that will work for v2 and v3 🙃 !

Fixed by setting compressor=None 😂😭😂

psobolewskiPhD added the bug Potential issues with the zarr-python library label Jan 4, 2025

psobolewskiPhD changed the title ~~Regression with 3.0.0rc1~~ Regression with 3.0.0rc1: reading zarr with tensorstore Jan 4, 2025

This was referenced Jan 4, 2025

[Maint] Update version_denylist.txt to block zarr rc1 in --pre tests napari/napari#7489

Merged

[test-bot] pip install --pre is failing napari/napari#7484

Closed

LDeakin mentioned this issue Jan 5, 2025

Add zstd codec zarr-developers/zarr-specs#256

Open

dstansby added this to the 3.0.0 milestone Jan 5, 2025

jni mentioned this issue Jan 5, 2025

Use no compression when writing zarr for tensorstore napari/napari#7490

Closed

d-v-b mentioned this issue Jan 6, 2025

should we abstract over v2 and v3 codecs #2654

Open

This was referenced Jan 6, 2025

Zstd: Don't persist the checksum param if false zarr-developers/numcodecs#681

Closed

Zarr v2 + Zstd: Don't persist the checksum param if false #2655

Merged

normanrz closed this as completed in #2655 Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression with 3.0.0rc1: reading zarr with tensorstore #2647

Regression with 3.0.0rc1: reading zarr with tensorstore #2647

psobolewskiPhD commented Jan 4, 2025 •

edited

Loading

jhamman commented Jan 4, 2025 •

edited by jni

Loading

psobolewskiPhD commented Jan 4, 2025 •

edited

Loading

jbms commented Jan 5, 2025

jni commented Jan 5, 2025

normanrz commented Jan 5, 2025

d-v-b commented Jan 5, 2025 •

edited

Loading

dstansby commented Jan 5, 2025

dstansby commented Jan 5, 2025

psobolewskiPhD commented Jan 5, 2025 •

edited

Loading

normanrz commented Jan 5, 2025

psobolewskiPhD commented Jan 5, 2025 •

edited

Loading

dstansby commented Jan 5, 2025

jbms commented Jan 5, 2025

jni commented Jan 5, 2025

Regression with 3.0.0rc1: reading zarr with tensorstore #2647

Regression with 3.0.0rc1: reading zarr with tensorstore #2647

Comments

psobolewskiPhD commented Jan 4, 2025 • edited Loading

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

jhamman commented Jan 4, 2025 • edited by jni Loading

psobolewskiPhD commented Jan 4, 2025 • edited Loading

jbms commented Jan 5, 2025

jni commented Jan 5, 2025

normanrz commented Jan 5, 2025

d-v-b commented Jan 5, 2025 • edited Loading

dstansby commented Jan 5, 2025

dstansby commented Jan 5, 2025

psobolewskiPhD commented Jan 5, 2025 • edited Loading

normanrz commented Jan 5, 2025

psobolewskiPhD commented Jan 5, 2025 • edited Loading

dstansby commented Jan 5, 2025

jbms commented Jan 5, 2025

jni commented Jan 5, 2025

psobolewskiPhD commented Jan 4, 2025 •

edited

Loading

jhamman commented Jan 4, 2025 •

edited by jni

Loading

psobolewskiPhD commented Jan 4, 2025 •

edited

Loading

d-v-b commented Jan 5, 2025 •

edited

Loading

psobolewskiPhD commented Jan 5, 2025 •

edited

Loading

psobolewskiPhD commented Jan 5, 2025 •

edited

Loading