Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special case str dtype in array creation #2323

Merged

Conversation

TomAugspurger
Copy link
Contributor

Special cases str, interpreting it as np.dtype("object"), to match zarr-python 2.x's behavior.

Closes #2315

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@TomAugspurger
Copy link
Contributor Author

One thing to figure out: what do we want for the dtype in the array metadata document?

>>> arr = zarr.create(shape=10, dtype="str")
>>> json.loads(arr.store._store_dict["zarr.json"].to_bytes())
{'shape': [10],
 'fill_value': 0,
 'chunk_grid': {'name': 'regular', 'configuration': {'chunk_shape': [10]}},
 'attributes': {},
 'zarr_format': 3,
 'data_type': 'object',
 'chunk_key_encoding': {'name': 'default',
  'configuration': {'separator': '/'}},
 'codecs': [{'name': 'bytes', 'configuration': {'endian': 'little'}}],
 'node_type': 'array',
 'storage_transformers': []}

Right now it's object. In zarr-python 2.x that was |O. I'm not immediately sure which is correct but I suspect either works (since presumably it's just passed to np.dtype() and those are the same).

@@ -162,3 +164,10 @@ def parse_order(data: Any) -> Literal["C", "F"]:
if data in ("C", "F"):
return cast(Literal["C", "F"], data)
raise ValueError(f"Expected one of ('C', 'F'), got {data} instead.")


def parse_dtype(dtype: Any) -> np.dtype[Any]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type could be greatly improved. Really, we want the same signature as np.dtype, but with an overload for str -> dtype(object). But maybe as a follow up.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

However, I'd love if if we could scope this just to V2 and not touch V3.

@@ -504,6 +504,7 @@ class DataType(Enum):
complex128 = "complex128"
string = "string"
bytes = "bytes"
object = "object"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a dealbreaker to me. Python objects are not an allowable V3 datatype. Eliminating Python objects was one of the main goals of the V3 evolution!

In #2036 I did a lot of work towards more nuanced string support in V3, and that is now mostly working with Xarray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python objects are not an allowable V3 datatype.

seconding this -- we definitely don't want an object dtype for v3 data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What behavior do people want, in terms of the in-memory representation and the on-disk metadata? Is this correct?

  1. No change for zarr_format=2. We use object dtype in-memory and |O or object in the metadata document.
  2. We use the new variable-width string handling for zarr_format=3. StringDtype() (maybe with a fallback to object if NumPy<2?) in memory and string in the metadata document

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is what I want. I believe 2 is already implemented and tested for V3 here:

a[:, :] = data
assert np.array_equal(data, a[:, :])
assert a.metadata.data_type == DataType.string
assert a.dtype == expected_zarr_string_dtype

I didn't touch V2 however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger TomAugspurger force-pushed the tom/fix/dtype-str-special-case branch from f85bb19 to 7e76e9e Compare October 9, 2024 15:09
@TomAugspurger TomAugspurger marked this pull request as ready for review October 9, 2024 15:10
@rabernat
Copy link
Contributor

rabernat commented Oct 9, 2024

This is still not quite working with Xarray the way I was hoping. Trying to track down why.

This is with this branch merged to xarray-compat:

import xarray as xr
import zarr
import numpy as np

ds = xr.Dataset({"strings": ("b", np.array(["ab", "cdef", "g"], dtype=object))})
store = zarr.storage.MemoryStore({}, mode="w")
ds.to_zarr(store, zarr_version=2)
zarr.open_group(store, zarr_version=2, mode="r")["strings"].dtype
# -> dtype('<U')

With Zarr 2.18.3

ds = xr.Dataset({"strings": ("b", np.array(["ab", "cdef", "g"], dtype=object))})
store = zarr.storage.MemoryStore({})
ds.to_zarr(store)
zarr.open_group(store, mode="r")["strings"].dtype
# -> dtype('O')

@TomAugspurger TomAugspurger added the downstream Downstream libraries using zarr label Oct 9, 2024
src/zarr/core/array.py Outdated Show resolved Hide resolved
@jhamman jhamman added this to the 3.0.0.beta milestone Oct 9, 2024
@jhamman jhamman added the V3 label Oct 9, 2024
@jhamman jhamman mentioned this pull request Oct 9, 2024
6 tasks
@TomAugspurger
Copy link
Contributor Author

@rabernat d8f24a8 fixes the issue you raised in #2323 (comment) when using this through xarray. We also need to special case filters when zarr_format=2 and dtype=str to automatically add the vlen-utf8 codec.

With an updated xarray-compat branch I'll push up later + that change I get the expected behavior:

In [2]: import xarray as xr
   ...: import zarr
   ...: import numpy as np
   ...: 
   ...: ds = xr.Dataset({"strings": ("b", np.array(["ab", "cdef", "g"], dtype=object))})
   ...: store = zarr.storage.MemoryStore({}, mode="w")
   ...: ds.to_zarr(store, zarr_format=2)
   ...: zarr.open_group(store, zarr_format=2, mode="r")["strings"].dtype
Out[2]: dtype('O')

Ah, unfortunately reading the data doesn't quite work yet:

In [9]: zarr.open_group(store, mode="r")["strings"][:]

eventually raises with

File ~/gh/zarr-developers/zarr-python/src/zarr/codecs/_v2.py:40, in V2Compressor._decode_single(self, chunk_bytes, chunk_spec)
     38 # ensure correct dtype
     39 if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
---> 40     chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
     42 return get_ndbuffer_class().from_numpy_array(chunk_numpy_array)

File ~/gh/pydata/xarray/.direnv/python-3.12/lib/python3.12/site-packages/numpy/_core/_internal.py:564, in _view_is_safe(oldtype, newtype)
    561     return
    563 if newtype.hasobject or oldtype.hasobject:
--> 564     raise TypeError("Cannot change data-type for array of references.")
    565 return

TypeError: Cannot change data-type for array of references.

I'll look into that too.

@TomAugspurger
Copy link
Contributor Author

Any codec experts want to chime in on whether 509a5c1 is appropriate? If I just remove that .view() we get some failures with the chunk shape not matching. So I guess we're somehow relying on that for correctness. But .view(object) isn't valid so I think we're OK to skip it?

@d-v-b
Copy link
Contributor

d-v-b commented Oct 10, 2024

Any codec experts want to chime in on whether 509a5c1 is appropriate? If I just remove that .view() we get some failures with the chunk shape not matching. So I guess we're somehow relying on that for correctness. But .view(object) isn't valid so I think we're OK to skip it?

If you leave the .view(object) do we get runtime errors? might be interesting to know where those are.

@TomAugspurger
Copy link
Contributor Author

Yeah, the traceback above has the relevant bit: TypeError: Cannot change data-type for array of references.

Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good! As a non-string-dtype user I'm pretty surprised by the complexity involved in getting strings working in v3. Do you think we should have a section of the docs that specifically covers strings in zarr 2 / 3? That would be a separate effort from this PR of course.

@TomAugspurger
Copy link
Contributor Author

Happy to write up those docs. One question on the intended behavior for v3. AFAICT, we don't support fixed-width, unicode strings. That uses StringDType().

In [1]: import zarr

In [2]: arr = zarr.create(shape=(3,), dtype="U3")

In [3]: arr[:] = ['a', 'bb', 'ccc']

In [4]: arr[:]
Out[4]: array(['a', 'bb', 'ccc'], dtype=StringDType())

There are some advantages to fixed-width strings when you know you have fixed-width data. Do we want to try to support that?

@rabernat
Copy link
Contributor

There are some advantages to fixed-width strings when you know you have fixed-width data. Do we want to try to support that?

Yes, I think it would be nice to support fixed-width data. In principle this can be done with just raw byte / int arrays, right? Logical vs. physical dtype etc.

@jhamman
Copy link
Member

jhamman commented Oct 11, 2024

Can we move the fixed-width string conversation to a separate issue so we can merge this?

@TomAugspurger
Copy link
Contributor Author

Yep, definitely.

@TomAugspurger TomAugspurger merged commit 0c679c8 into zarr-developers:v3 Oct 11, 2024
20 checks passed
@TomAugspurger TomAugspurger deleted the tom/fix/dtype-str-special-case branch October 11, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
downstream Downstream libraries using zarr
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

zarr.create(dtype=str) has different dtype in 3 (<U) vs 2 (O)
4 participants