Skip to content

codec registration silently fails #3261

@d-v-b

Description

@d-v-b

When trying to register a codec with the same name as a numcodecs codec (e.g., to override the behavior of that codec), the new codec cannot be retrieved from the registry.

here's a reproducer. The expected behavior is that, after registering my codec under a key, I should be able to pull it out under the same key. That fails, and the (broken) numcodecs codec is retrieved instead, which kind of defeats the entire purpose of a dynamic registry.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues


from numcodecs.zarr3 import Delta
import numpy as np
import zarr
from zarr.core.array_spec import ArraySpec
from zarr.dtype import parse_data_type
from zarr.registry import register_codec, get_codec_class
from dataclasses import replace

class PatchedDelta(Delta):
    def resolve_metadata(self, chunk_spec: ArraySpec) -> ArraySpec:
        if astype := self.codec_config.get("astype"):
            return replace(chunk_spec, dtype=parse_data_type(astype, zarr_format=3))  # type: ignore[call-overload]
        return chunk_spec

key = 'numcodecs.delta'
value = PatchedDelta
register_codec(key, value)
print(get_codec_class(key))
# <class 'numcodecs.zarr3.Delta'>
assert get_codec_class(key) == value

The underlying problem stems from the design of the codec registry: the same codec name can be associated with multiple codecs, and in this case, a codec is retrieved in FILO order? The end result is that a numcodecs codec always wins, and the user-defined codec can never be retrieved:

config_entry = config.get("codecs", {}).get(key)
if config_entry is None:
if len(codec_classes) == 1:
return next(iter(codec_classes.values()))
warnings.warn(
f"Codec '{key}' not configured in config. Selecting any implementation.",
stacklevel=2,
)
return list(codec_classes.values())[-1]

I think the logic here was to support e.g. CPU and GPU implementations for the same codec, but our current solution is bad. The registry should be a 1:1 mapping between identifiers and codecs. If we need to support CPU codecs and GPU codecs, then we have 2 registries, not a single broken registry.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions