Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conformant ZarrV3 codecs and fill values #193

Merged

Conversation

ghidalgo3
Copy link
Contributor

@ghidalgo3 ghidalgo3 commented Jul 17, 2024

This change serializes the ZArray codec attributes filters, compressor, and order according to the ZarrV3 specification, along with the default fill value of the array. This should hopefully allow a theoretical ZarrV3 reader to read a zarr store produced by VirtualiZarr, one day.

I would consider this a breaking change because any existing VirtualiZarr stores that used the "before" format shown here would be unreadable now, but given that those metadata files were invalid zarrv3 anyway I'm not sure if it's worth handling that at possibility at read time. If you think it necessary, reviewer, I will implement it.

I'm also not too sure if adding the transpose and bytes codec unconditionally is necessary. I think readers can assume that no transpose codec means no data transposition obviously, but it's unclear to me if the codec pipeline must declare either one of:

Also, what happens if a source file uses a codec that is not one of the specified codecs of ZarrV3? Does that mean the file cannot be represented in ZarrV3? Seems rather onerous.

It may make sense to also change _check_same_codecs to use the codec_pipeline list declared in this PR.

Before

{
    "attributes": {},
    "chunk_grid": {
        "configuration": {
            "chunk_shape": [
                2,
                3
            ]
        },
        "name": "regular"
    },
    "chunk_key_encoding": {
        "configuration": {
            "separator": "/"
        },
        "name": "default"
    },
    "codecs": null,
    "data_type": "int64",
    "dimension_names": [
        "x",
        "y"
    ],
    "fill_value": null,
    "node_type": "array",
    "shape": [
        2,
        3
    ],
    "storage_transformers": [
        {
            "configuration": {
                "manifest": "./manifest.json"
            },
            "name": "chunk-manifest-json"
        }
    ],
    "zarr_format": 3
}

After

{
    "attributes": {},
    "chunk_grid": {
        "configuration": {
            "chunk_shape": [
                2,
                3
            ]
        },
        "name": "regular"
    },
    "chunk_key_encoding": {
        "configuration": {
            "separator": "/"
        },
        "name": "default"
    },
    "codecs": [
        {
            "configuration": {
                "order": [
                    0,
                    1
                ]
            },
            "name": "transpose"
        },
        {
            "configuration": {},
            "name": "bytes"
        },
        {
            "configuration": {
                "level": 1
            },
            "name": "gzip"
        }
    ],
    "data_type": "int64",
    "dimension_names": [
        "x",
        "y"
    ],
    "fill_value": 0,
    "node_type": "array",
    "shape": [
        2,
        3
    ],
    "storage_transformers": [
        {
            "configuration": {
                "manifest": "./manifest.json"
            },
            "name": "chunk-manifest-json"
        }
    ],
    "zarr_format": 3
}
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst

Issues:

sharkinsspatial and others added 30 commits April 19, 2024 13:31
@TomNicholas
Copy link
Collaborator

Also, what happens if a source file uses a codec that is not one of the specified codecs of ZarrV3? Does that mean the file cannot be represented in ZarrV3? Seems rather onerous.

I'm going to tag @sharkinsspatial here who I think has thought about this more than I have

elif self.dtype is np.dtype("int"):
return 0
elif self.dtype is np.dtype("float"):
return "NaN"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the default fill value for float is 0.0:

import zarr
import json
store = zarr.store.MemoryStore(mode="w")
z = zarr.empty((1, 1), store=store)
z[:]

array([[0.]])

(I'm not sure where on the Array / Store / Other object that information lives.)

It'd be nice if zarr-python had this as a constant that we could reuse. Would that make sense, or is there some reason not to I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the specification doesn't require a specific number, just that it not be null. See the note at the bottom of the fill_value section https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value . However zarr-python does default to 0 eventually and not NaN https://github.com/zarr-developers/zarr-python/blob/37a8441c20dae3b284803bb1b0d2e6c8f040fb3e/src/zarr/array.py#L231C9-L235C31 . I may have had some trouble with the unit tests, but I think it's better to be as similar as possible to zarr-python, I'll change the defaults to 0s.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could also be deferred to a later PR, especially if the true solution is to make it clearer what the default is upstream.

virtualizarr/zarr.py Show resolved Hide resolved
virtualizarr/zarr.py Show resolved Hide resolved
ghidalgo3 and others added 5 commits July 17, 2024 21:19
Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>
Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. One question about the default compressor in the new codecs pipeline.

virtualizarr/tests/test_zarr.py Outdated Show resolved Hide resolved
virtualizarr/tests/test_zarr.py Outdated Show resolved Hide resolved
virtualizarr/zarr.py Outdated Show resolved Hide resolved
@ghidalgo3
Copy link
Contributor Author

ghidalgo3 commented Jul 19, 2024

Thanks @t-mcneely for pairing on this with me! I'll write up the results of our PoC with GOES data. Sorry for the messy commit and revert, we accidentally merged Sean's HDF5 branch to make reading GOES HDF5 files work.

Tria McNeely and others added 2 commits July 19, 2024 14:43
@TomNicholas TomNicholas added the zarr-python Relevant to zarr-python upstream label Jul 21, 2024
virtualizarr/zarr.py Outdated Show resolved Hide resolved
virtualizarr/zarr.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Collaborator

Thanks for going into these weeds guys! This is amazing. Just a couple of extremely minor comments then I will merge it.

@TomNicholas TomNicholas merged commit 10bd53d into zarr-developers:main Jul 22, 2024
7 checks passed
@TomNicholas
Copy link
Collaborator

Thanks @ghidalgo3 !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
zarr-python Relevant to zarr-python upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants