Special case `str` dtype in array creation #2323

TomAugspurger · 2024-10-09T14:33:02Z

Special cases str, interpreting it as np.dtype("object"), to match zarr-python 2.x's behavior.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

TomAugspurger · 2024-10-09T14:36:40Z

One thing to figure out: what do we want for the dtype in the array metadata document?

>>> arr = zarr.create(shape=10, dtype="str")
>>> json.loads(arr.store._store_dict["zarr.json"].to_bytes())
{'shape': [10],
 'fill_value': 0,
 'chunk_grid': {'name': 'regular', 'configuration': {'chunk_shape': [10]}},
 'attributes': {},
 'zarr_format': 3,
 'data_type': 'object',
 'chunk_key_encoding': {'name': 'default',
  'configuration': {'separator': '/'}},
 'codecs': [{'name': 'bytes', 'configuration': {'endian': 'little'}}],
 'node_type': 'array',
 'storage_transformers': []}

Right now it's object. In zarr-python 2.x that was |O. I'm not immediately sure which is correct but I suspect either works (since presumably it's just passed to np.dtype() and those are the same).

TomAugspurger · 2024-10-09T14:38:34Z

src/zarr/core/common.py

@@ -162,3 +164,10 @@ def parse_order(data: Any) -> Literal["C", "F"]:
    if data in ("C", "F"):
        return cast(Literal["C", "F"], data)
    raise ValueError(f"Expected one of ('C', 'F'), got {data} instead.")
+
+
+def parse_dtype(dtype: Any) -> np.dtype[Any]:


This type could be greatly improved. Really, we want the same signature as np.dtype, but with an overload for str -> dtype(object). But maybe as a follow up.

rabernat

Thanks for working on this!

However, I'd love if if we could scope this just to V2 and not touch V3.

rabernat · 2024-10-09T14:46:43Z

src/zarr/core/metadata/v3.py

@@ -504,6 +504,7 @@ class DataType(Enum):
    complex128 = "complex128"
    string = "string"
    bytes = "bytes"
+    object = "object"


This feels like a dealbreaker to me. Python objects are not an allowable V3 datatype. Eliminating Python objects was one of the main goals of the V3 evolution!

In #2036 I did a lot of work towards more nuanced string support in V3, and that is now mostly working with Xarray.

Python objects are not an allowable V3 datatype.

seconding this -- we definitely don't want an object dtype for v3 data.

What behavior do people want, in terms of the in-memory representation and the on-disk metadata? Is this correct?

No change for zarr_format=2. We use object dtype in-memory and |O or object in the metadata document.

We use the new variable-width string handling for zarr_format=3. StringDtype() (maybe with a fallback to object if NumPy<2?) in memory and string in the metadata document

Yes that is what I want. I believe 2 is already implemented and tested for V3 here:

zarr-python/tests/v3/test_codecs/test_vlen.py

Lines 49 to 52 in aa46b45

a[:, :] = data

assert np.array_equal(data, a[:, :])

assert a.metadata.data_type == DataType.string

assert a.dtype == expected_zarr_string_dtype

I didn't touch V2 however.

See also the logic in https://github.com/zarr-developers/zarr-python/blob/v3/src/zarr/core/strings.py

rabernat · 2024-10-09T17:55:01Z

This is still not quite working with Xarray the way I was hoping. Trying to track down why.

This is with this branch merged to xarray-compat:

import xarray as xr
import zarr
import numpy as np

ds = xr.Dataset({"strings": ("b", np.array(["ab", "cdef", "g"], dtype=object))})
store = zarr.storage.MemoryStore({}, mode="w")
ds.to_zarr(store, zarr_version=2)
zarr.open_group(store, zarr_version=2, mode="r")["strings"].dtype
# -> dtype('<U')

With Zarr 2.18.3

ds = xr.Dataset({"strings": ("b", np.array(["ab", "cdef", "g"], dtype=object))})
store = zarr.storage.MemoryStore({})
ds.to_zarr(store)
zarr.open_group(store, mode="r")["strings"].dtype
# -> dtype('O')

src/zarr/core/array.py

…cial-case

…pecial-case' into tom/fix/dtype-str-special-case

tests/v3/test_v2.py

TomAugspurger · 2024-10-10T16:37:40Z

@rabernat d8f24a8 fixes the issue you raised in #2323 (comment) when using this through xarray. We also need to special case filters when zarr_format=2 and dtype=str to automatically add the vlen-utf8 codec.

With an updated xarray-compat branch I'll push up later + that change I get the expected behavior:

In [2]: import xarray as xr
   ...: import zarr
   ...: import numpy as np
   ...: 
   ...: ds = xr.Dataset({"strings": ("b", np.array(["ab", "cdef", "g"], dtype=object))})
   ...: store = zarr.storage.MemoryStore({}, mode="w")
   ...: ds.to_zarr(store, zarr_format=2)
   ...: zarr.open_group(store, zarr_format=2, mode="r")["strings"].dtype
Out[2]: dtype('O')

Ah, unfortunately reading the data doesn't quite work yet:

In [9]: zarr.open_group(store, mode="r")["strings"][:]

eventually raises with

File ~/gh/zarr-developers/zarr-python/src/zarr/codecs/_v2.py:40, in V2Compressor._decode_single(self, chunk_bytes, chunk_spec)
     38 # ensure correct dtype
     39 if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
---> 40     chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
     42 return get_ndbuffer_class().from_numpy_array(chunk_numpy_array)

File ~/gh/pydata/xarray/.direnv/python-3.12/lib/python3.12/site-packages/numpy/_core/_internal.py:564, in _view_is_safe(oldtype, newtype)
    561     return
    563 if newtype.hasobject or oldtype.hasobject:
--> 564     raise TypeError("Cannot change data-type for array of references.")
    565 return

TypeError: Cannot change data-type for array of references.

I'll look into that too.

…cial-case

TomAugspurger · 2024-10-10T19:55:57Z

Any codec experts want to chime in on whether 509a5c1 is appropriate? If I just remove that .view() we get some failures with the chunk shape not matching. So I guess we're somehow relying on that for correctness. But .view(object) isn't valid so I think we're OK to skip it?

d-v-b · 2024-10-10T20:01:33Z

Any codec experts want to chime in on whether 509a5c1 is appropriate? If I just remove that .view() we get some failures with the chunk shape not matching. So I guess we're somehow relying on that for correctness. But .view(object) isn't valid so I think we're OK to skip it?

If you leave the .view(object) do we get runtime errors? might be interesting to know where those are.

TomAugspurger · 2024-10-10T20:07:27Z

Yeah, the traceback above has the relevant bit: TypeError: Cannot change data-type for array of references.

d-v-b

this looks good! As a non-string-dtype user I'm pretty surprised by the complexity involved in getting strings working in v3. Do you think we should have a section of the docs that specifically covers strings in zarr 2 / 3? That would be a separate effort from this PR of course.

TomAugspurger · 2024-10-11T14:44:05Z

Happy to write up those docs. One question on the intended behavior for v3. AFAICT, we don't support fixed-width, unicode strings. That uses StringDType().

In [1]: import zarr

In [2]: arr = zarr.create(shape=(3,), dtype="U3")

In [3]: arr[:] = ['a', 'bb', 'ccc']

In [4]: arr[:]
Out[4]: array(['a', 'bb', 'ccc'], dtype=StringDType())

There are some advantages to fixed-width strings when you know you have fixed-width data. Do we want to try to support that?

rabernat · 2024-10-11T14:54:05Z

There are some advantages to fixed-width strings when you know you have fixed-width data. Do we want to try to support that?

Yes, I think it would be nice to support fixed-width data. In principle this can be done with just raw byte / int arrays, right? Logical vs. physical dtype etc.

jhamman · 2024-10-11T15:26:42Z

Can we move the fixed-width string conversation to a separate issue so we can merge this?

TomAugspurger · 2024-10-11T16:00:15Z

Yep, definitely.

Special case object dtype

483681b

Closes zarr-developers#2315

TomAugspurger commented Oct 9, 2024

View reviewed changes

rabernat requested changes Oct 9, 2024

View reviewed changes

fixup

7e76e9e

TomAugspurger force-pushed the tom/fix/dtype-str-special-case branch from f85bb19 to 7e76e9e Compare October 9, 2024 15:09

TomAugspurger marked this pull request as ready for review October 9, 2024 15:10

rabernat approved these changes Oct 9, 2024

View reviewed changes

TomAugspurger added the downstream Downstream libraries using zarr label Oct 9, 2024

jhamman approved these changes Oct 9, 2024

View reviewed changes

src/zarr/core/array.py Outdated Show resolved Hide resolved

jhamman added this to the 3.0.0.beta milestone Oct 9, 2024

jhamman added the V3 label Oct 9, 2024

TomAugspurger added 2 commits October 9, 2024 16:17

Merge remote-tracking branch 'upstream/v3' into tom/fix/dtype-str-spe…

add45e6

…cial-case

remove dead code

2db00ff

jhamman mentioned this pull request Oct 9, 2024

Zarr-v3 Consolidated Metadata #2113

Merged

6 tasks

jhamman and others added 3 commits October 9, 2024 20:46

Merge branch 'v3' into tom/fix/dtype-str-special-case

df92bad

Merge remote-tracking branch 'refs/remotes/origin/tom/fix/dtype-str-s…

494e006

…pecial-case' into tom/fix/dtype-str-special-case

fixup

4b0a39e

TomAugspurger commented Oct 10, 2024

View reviewed changes

tests/v3/test_v2.py Show resolved Hide resolved

automatically add filter

d8f24a8

TomAugspurger added 2 commits October 10, 2024 14:50

Merge remote-tracking branch 'upstream/v3' into tom/fix/dtype-str-spe…

9603b0e

…cial-case

maybe fixed

509a5c1

Merge branch 'v3' into tom/fix/dtype-str-special-case

6ea15ea

jhamman approved these changes Oct 11, 2024

View reviewed changes

d-v-b approved these changes Oct 11, 2024

View reviewed changes

TomAugspurger merged commit 0c679c8 into zarr-developers:v3 Oct 11, 2024
20 checks passed

TomAugspurger deleted the tom/fix/dtype-str-special-case branch October 11, 2024 16:00

TomAugspurger mentioned this pull request Oct 12, 2024

[v3] Fixed-width unicode string support in zarr v3 #2347

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special case `str` dtype in array creation #2323

Special case `str` dtype in array creation #2323

TomAugspurger commented Oct 9, 2024

TomAugspurger commented Oct 9, 2024

TomAugspurger Oct 9, 2024

rabernat left a comment

rabernat Oct 9, 2024

d-v-b Oct 9, 2024

TomAugspurger Oct 9, 2024

rabernat Oct 9, 2024

rabernat Oct 9, 2024

rabernat commented Oct 9, 2024 •

edited

Loading

TomAugspurger commented Oct 10, 2024

TomAugspurger commented Oct 10, 2024

d-v-b commented Oct 10, 2024

TomAugspurger commented Oct 10, 2024

d-v-b left a comment

TomAugspurger commented Oct 11, 2024

rabernat commented Oct 11, 2024

jhamman commented Oct 11, 2024

TomAugspurger commented Oct 11, 2024

	a[:, :] = data
	assert np.array_equal(data, a[:, :])
	assert a.metadata.data_type == DataType.string
	assert a.dtype == expected_zarr_string_dtype

Special case str dtype in array creation #2323

Special case str dtype in array creation #2323

Conversation

TomAugspurger commented Oct 9, 2024

TomAugspurger commented Oct 9, 2024

TomAugspurger Oct 9, 2024

Choose a reason for hiding this comment

rabernat left a comment

Choose a reason for hiding this comment

rabernat Oct 9, 2024

Choose a reason for hiding this comment

d-v-b Oct 9, 2024

Choose a reason for hiding this comment

TomAugspurger Oct 9, 2024

Choose a reason for hiding this comment

rabernat Oct 9, 2024

Choose a reason for hiding this comment

rabernat Oct 9, 2024

Choose a reason for hiding this comment

rabernat commented Oct 9, 2024 • edited Loading

TomAugspurger commented Oct 10, 2024

TomAugspurger commented Oct 10, 2024

d-v-b commented Oct 10, 2024

TomAugspurger commented Oct 10, 2024

d-v-b left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 11, 2024

rabernat commented Oct 11, 2024

jhamman commented Oct 11, 2024

TomAugspurger commented Oct 11, 2024

Special case `str` dtype in array creation #2323

Special case `str` dtype in array creation #2323

rabernat commented Oct 9, 2024 •

edited

Loading