Fix upcasting with python builtin numbers and numpy 2 #8946

djhoese · 2024-04-15T20:07:42Z

See #8402 for more discussion. Bottom line is that numpy 2 changes the rules for casting between two inputs. Due to this and xarray's preference for promoting python scalars to 0d arrays (scalar arrays), xarray objects are being upcast to higher data types when they previously didn't.

I'm mainly opening this PR for further and more detailed discussion.

CC @dcherian

Closes where dtype upcast with numpy 2 #8402, closes ⚠️ Nightly upstream-dev CI failed ⚠️ #8844
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

djhoese · 2024-04-15T20:11:30Z

Ugh my local clone was so old it was pointing to master. One sec...

for more information, see https://pre-commit.ci

djhoese · 2024-04-15T21:00:03Z

Ok so the failing test is the array-api version (https://github.com/data-apis/array-api-compat) where it expects both the x and y inputs of the where function to be .dtype. Since we're skipping scalar->array conversion in this PR those objects won't have a .dtype. I'm not sure what the rules are for the strict array API having scalar inputs.

dcherian · 2024-04-15T21:07:46Z

Looks like the array api strictly wants arrays: https://data-apis.org/array-api/latest/API_specification/generated/array_api.where.html

djhoese · 2024-04-15T21:07:48Z

Related but I don't fully understand it: data-apis/array-api-compat#85

djhoese · 2024-04-16T02:06:47Z

I guess it depends how you interpret the array API standard then. I can file an issue if needed. To me, depending on how you read the standard, it means either:

This test is flawed as it tests scalar inputs when the array API specifically defines Array inputs.
The Array API package is flawed because it assumes and requires Array inputs when the standard allows for scalar inputs (I don't think this is true if I'm understanding the description).

The other point is that maybe numpy compatibility is more important until numpy more formally conforms to the array API standard (see the first note on https://data-apis.org/array-api/latest/API_specification/array_object.html#api-specification-array-object--page-root). But also type promotion seems wishy-washy and not super strict: https://data-apis.org/array-api/latest/API_specification/type_promotion.html#mixing-arrays-with-python-scalars

I propose, because it works best for me and matches numpy compatibility, that I update the test to have a numpy case only but add a new test function with numpy and array api cases with array inputs to .where instead of scalars.

This reverts commit 3f7670b.

* main: (feat): Support for `pandas` `ExtensionArray` (pydata#8723) Migrate datatree mapping.py (pydata#8948) Add mypy to dev dependencies (pydata#8947) Convert 360_day calendars by choosing random dates to drop or add (pydata#8603)

dcherian · 2024-04-18T14:25:17Z

I lean towards (1).

I looked at this for a while, and we'll need major changes around handling array API dtype objects to do this properly.

cc @keewis

keewis · 2024-04-22T09:42:26Z

we'll need major changes around handling array API dtype objects to do this properly.

I think the change could be limited to xarray.core.duck_array_ops.as_shared_dtype. According to the Array API section on mixing scalars and arrays, we should to use the dtype of the array (though it only looks at scalar + 1 array, so we'd need to extend that).

However, what we currently do is cast all scalars to arrays using asarray, which means python scalars use the OS default dtype (e.g. float64 on most 64-bit systems).

As a algorithm, maybe this could work:

separate the input into python scalars and arrays / scalars with dtype
determine result_type using just the arrays / scalars with dtype
check that all python scalars are compatible with the result (otherwise might have to return object?)
cast all input to arrays with the dtype

djhoese · 2024-04-22T15:46:33Z

According to the Array API section on mixing scalars and arrays, we should to use the dtype of the array (though it only looks at scalar + 1 array, so we'd need to extend that).

Do you know if this is inline with numpy 2 dtype casting behavior?

keewis · 2024-04-22T16:00:54Z

The main numpy namespace is supposed to be Array API compatible, so it should? I don't know for certain, though.

djhoese · 2024-04-22T19:53:36Z

check that all python scalars are compatible with the result (otherwise might have to return object?)

How do we check this?

djhoese · 2024-04-22T20:11:25Z

Here's what I have locally which seems to pass:

Subject: [PATCH] Cast scalars as arrays with result type of only arrays
---
Index: xarray/core/duck_array_ops.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/xarray/core/duck_array_ops.py b/xarray/core/duck_array_ops.py
--- a/xarray/core/duck_array_ops.py	(revision e27f572585a6386729a5523c1f9082c72fa8d178)
+++ b/xarray/core/duck_array_ops.py	(date 1713816523554)
@@ -239,20 +239,30 @@
         import cupy as cp
 
         arrays = [asarray(x, xp=cp) for x in scalars_or_arrays]
+        # Pass arrays directly instead of dtypes to result_type so scalars
+        # get handled properly.
+        # Note that result_type() safely gets the dtype from dask arrays without
+        # evaluating them.
+        out_type = dtypes.result_type(*arrays)
     else:
-        arrays = [
-            # https://github.com/pydata/xarray/issues/8402
-            # https://github.com/pydata/xarray/issues/7721
-            x if isinstance(x, (int, float, complex)) else asarray(x, xp=xp)
-            for x in scalars_or_arrays
-        ]
-    # Pass arrays directly instead of dtypes to result_type so scalars
-    # get handled properly.
-    # Note that result_type() safely gets the dtype from dask arrays without
-    # evaluating them.
-    out_type = dtypes.result_type(*arrays)
+        # arrays = [
+        #     # https://github.com/pydata/xarray/issues/8402
+        #     # https://github.com/pydata/xarray/issues/7721
+        #     x if isinstance(x, (int, float, complex)) else asarray(x, xp=xp)
+        #     for x in scalars_or_arrays
+        # ]
+        objs_with_dtype = [obj for obj in scalars_or_arrays if hasattr(obj, "dtype")]
+        if objs_with_dtype:
+            # Pass arrays directly instead of dtypes to result_type so scalars
+            # get handled properly.
+            # Note that result_type() safely gets the dtype from dask arrays without
+            # evaluating them.
+            out_type = dtypes.result_type(*objs_with_dtype)
+        else:
+            out_type = dtypes.result_type(*scalars_or_arrays)
+        arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]
     return [
-        astype(x, out_type, copy=False) if hasattr(x, "dtype") else x for x in arrays
+        astype(x, out_type, copy=False) for x in arrays
     ]

I just through it together to see if it would work. I'm not sure it is accurate, but the fact that it is almost exactly like the existing solution with the only difference being the out_type = changes makes me feel this is going in a good direction.

Note I had to do if objs_with_dtype: because the test passes two python scalars so there are no arrays to determine the result type.

keewis · 2024-04-22T21:10:25Z

How do we check this?

Not sure... but there are only so many builtin types that can be involved without requiring object dtype, so we could just enumerate all of them? As far as I can tell, that would be: bool, int, float, str, datetime/date, and timedelta

djhoese · 2024-04-26T01:46:41Z

check that all python scalars are compatible with the result (otherwise might have to return object?)

How do we check this?

@keewis Do you have a test that I can add to verify any fix I attempt for this? What do you mean by python scalar being compatible with the result?

keewis · 2024-04-26T19:36:48Z

well, for example, what should happen for this:

a = xr.DataArray(np.array([1, 2, 3], dtype="int8"), dim="x")
xr.where(a % 2 == 1, a, 1.2)

according to the algorithm above, we have one array of dtype int8, so that means we'd have to check if 1.2 (a float) is compatible with int8. It is not, so we should promote everything to float (the default would be to use float64, which might be a bit weird).

Something similar:

a = xr.DataArray(np.array(["2019-01-01", "2020-01-01"], dtype="datetime64[ns]"), dim="x")
xr.where(a.x % 2 == 1, a, datetime.datetime(2019, 6, 30))

in that case, the check should succeed, because we can convert a builtin datetime object to datetime64[ns].

djhoese · 2024-04-28T02:06:31Z

I committed my (what I consider ugly) implementation of your original approach @keewis. I'm still not sure I understand how to approach the scalar compatibility so if someone has some ideas then please make some suggestion comments or commits directly if you have the permissions.

keewis · 2024-04-28T10:16:45Z

this might be cleaner:

def asarray(data, xp=np, dtype=None):
    return data if is_duck_array(data) else xp.asarray(data, dtype=dtype)


def as_shared_dtype(scalars_or_arrays, xp=np):
    """Cast a arrays to a shared dtype using xarray's type promotion rules."""
    if any(is_extension_array_dtype(x) for x in scalars_or_arrays):
        # as soon as extension arrays are involved we only use this:
        extension_array_types = [
            x.dtype for x in scalars_or_arrays if is_extension_array_dtype(x)
        ]
        if len(extension_array_types) == len(scalars_or_arrays) and all(
            isinstance(x, type(extension_array_types[0])) for x in extension_array_types
        ):
            return scalars_or_arrays
        raise ValueError(
            f"Cannot cast arrays to shared type, found array types {[x.dtype for x in scalars_or_arrays]}"
        )

    if array_type_cupy := array_type("cupy") and any(  # noqa: F841
        isinstance(x, array_type_cupy) for x in scalars_or_arrays  # noqa: F821
    ):
        import cupy as cp

        xp_ = cp
    else:
        xp_ = xp

    # split into python scalars and arrays / numpy scalars (i.e. into weakly and strongly dtyped)
    with_dtype = {}
    python_scalars = {}
    for index, elem in enumerate(scalars_or_arrays):
        append_to = with_dtype if hasattr(elem, "dtype") else python_scalars
        append_to[index] = elem

    if with_dtype:
        to_convert = with_dtype
    else:
        # can't avoid using the default dtypes if we only get weak dtypes
        to_convert = python_scalars
        python_scalars = {}

    arrays = {index: asarray(x, xp=xp_) for index, x in to_convert.items()}

    common_dtype = dtypes.result_type(*arrays.values())
    # TODO(keewis): check that all python scalars are compatible. If not, change the dtype or raise.

    # cast arrays
    cast = {index: astype(x, dtype=common_dtype, copy=False) for index, x in arrays.items()}
    # convert python scalars to arrays with a specific dtype
    converted = {index: asarray(x, xp=xp_, dtype=common_dtype) for index, x in python_scalars.items()}

    # merge both
    combined = cast | converted
    return [x for _, x in sorted(combined.items(), key=lambda x: x[0])]

This is still missing the dtype fallbacks, though.

keewis · 2024-04-28T11:13:36Z

I see now why the dtype fallbacks for scalars is tricky... we basically need to enumerate the casting rules, and decide when to return a different dtype (like object). numpy has can_cast with the option to choose the strictness (so we could use "same_kind") and it accepts python scalar types, while the Array API does not allow that choice, and we also can't pass in python scalar types.

To start, here's the rules from the Array API:

complex dtypes are compatible with int, float, or complex
float dtypes are compatible with any int or float
int dtypes are compatible with int (but beware: python uses BigInt, so the value might exceed the maximum of the dtype)
the bool dtype is only compatible with bool

From numpy, we also have these (numpy casting is even more relaxed than this, but that behavior may also cause some confusing issues):

bool can be cast to int, so it is compatible with anything int is compatible with
str dtypes are only compatible with str. Anything else, like formatting and casting to other types, has to be done explicitly before calling as_shared_dtype.
datetime dtypes (precisions) are compatible with datetime.datetime, datetime.date, and pd.Timestamp
timedelta dtypes (precisions) are compatible with datetime.timedelta and pd.Timedelta. Casting to int is possible, but has to be done explicitly (i.e. we can ignore it here)
anything else results in a object dtype

Edit: it appears NEP 50 describes the changes in detail. I didn't see that before writing both the list above and implementing the changes, so I might have to change both.

keewis · 2024-04-28T16:23:06Z

here's my shot at the ~~scalar dtype verification~~ (the final implementation we settled on in the end is much better). I'm pretty sure it can be cleaned up further (and we need more tests), but it does fix all the casting issues. Edit: note that this depends on the Array API fixes for numpy>=2.

What I don't like is that we're essentially hard-coding the dtype casting hierarchy, but I couldn't figure out a way to make it work without that.

djhoese · 2024-05-12T16:55:49Z

FYI to everyone watching this, I'm going to be switching to a heavier paternity leave than I was already starting this week. I think someone else should take this PR over as I don't think I'll have time to finish it in time for the numpy 2 final release.

xarray/tests/test_array_api.py

keewis · 2024-06-07T20:18:49Z

I decided to do this now rather than later. Good news is that this is finally ready for a review and possibly even merging (cc @shoyer).

Edit: ~~actually, no. Trying to run this using a default environment but with the release candidate reveals a lot of errors. I'm investigating.~~ looks like most of this was just numba being incompatible with numpy>=2.0.0.rc1. ~~Still investigating the rest.~~ more environment issues, but there's a couple of issues which should be fixed in the final release of numpy=2:

FAILED xarray/tests/test_dtypes.py::test_maybe_promote[q-expected19] - AssertionError: assert dtype('O') == <class 'numpy.float64'>
FAILED xarray/tests/test_dtypes.py::test_maybe_promote[Q-expected20] - AssertionError: assert dtype('O') == <class 'numpy.float64'>

However, there's also these:

FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype0-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype1-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8

~~which require further investigation~~

Edit: all good, the last one is skipped in the nightly builds because we don't have numpy>=2-compatible builds of netcdf4, yet. Once we do, we'll have to revisit.

xarray/core/duck_array_ops.py

shoyer

Looks great. Thanks @keewis for your patience here and getting this to the finish line!

xarray/core/dtypes.py

xarray/core/duck_array_ops.py

xarray/tests/test_array_api.py

keewis · 2024-06-10T10:53:53Z

if my most recent changes are fine, this should be ready for merging (the remaining upstream-dev test failures will be fixed by #9081).

Once that is done, I will cut a release to have at least one release that is compatible with numpy>=2 before that is released.

xarray/core/npcompat.py

xarray/core/array_api_compat.py

dcherian · 2024-06-10T17:54:21Z

Wow. Thanks @keewis 👏 👏

keewis · 2024-06-11T16:43:38Z

thanks for all the help and advice, @shoyer! And thanks for kicking this off, @djhoese.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <deepak@cherian.net> Co-authored-by: Justus Magin <keewis@posteo.de> Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

dcherian added the run-upstream Run upstream CI label Apr 15, 2024

Fix upcasting with python builtin numbers and numpy 2

f3c2c93

djhoese force-pushed the bugfix-scalar-arr-casting branch from 88e778a to f3c2c93 Compare April 15, 2024 20:14

pre-commit-ci bot and others added 2 commits April 15, 2024 20:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c8a607

for more information, see https://pre-commit.ci

Remove old commented code

fcbd821

dcherian added 4 commits April 18, 2024 08:23

Try updating result_type for array API

3f7670b

xfail

6115dd8

Revert "Try updating result_type for array API"

e3493b0

This reverts commit 3f7670b.

Merge branch 'main' into bugfix-scalar-arr-casting

e27f572

* main: (feat): Support for `pandas` `ExtensionArray` (pydata#8723) Migrate datatree mapping.py (pydata#8948) Add mypy to dev dependencies (pydata#8947) Convert 360_day calendars by choosing random dates to drop or add (pydata#8603)

dcherian requested a review from keewis April 18, 2024 14:25

Determine python scalar dtype from array arguments

5b4384f

keewis reviewed Jun 7, 2024

View reviewed changes

xarray/tests/test_array_api.py Outdated Show resolved Hide resolved

keewis reviewed Jun 7, 2024

View reviewed changes

xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved

keewis and others added 4 commits June 7, 2024 22:25

remove obsolete comment

4e86aa2

proper placement of the result_type comment

c28233c

back to using a list instead of a tuple

0977707

expect the platform-specific default dtype for integers

57b4eec

shoyer approved these changes Jun 9, 2024

View reviewed changes

xarray/core/dtypes.py Outdated Show resolved Hide resolved

xarray/core/duck_array_ops.py Show resolved Hide resolved

xarray/tests/test_array_api.py Outdated Show resolved Hide resolved

keewis added 7 commits June 10, 2024 11:09

move _future_array_api_result_type to npcompat

96e75c6

rename is_scalar_type to is_weak_scalar_type

afa495e

move the dispatch to _future_array_api_result_type to npcompat

1dc9d46

support passing *only* weak dtypes to _future_array_api_result_type

b405916

don't xfail the Array API where test

66233a1

determine the array namespace if not passed and no cupy arrays present

445195b

comment on what to do with npcompat.result_type

b2a6379

keewis added the plan to merge Final call for comments label Jun 10, 2024

Merge branch 'main' into bugfix-scalar-arr-casting

79d4a26

dcherian mentioned this pull request Jun 10, 2024

Release 2024.06.0 #9084

Closed

shoyer reviewed Jun 10, 2024

View reviewed changes

xarray/core/npcompat.py Outdated Show resolved Hide resolved

keewis added 2 commits June 10, 2024 18:39

move the result_type compat code to array_api_compat

ab284f7

also update the comment

cf11228

shoyer approved these changes Jun 10, 2024

View reviewed changes

xarray/core/array_api_compat.py Show resolved Hide resolved

keewis added 3 commits June 10, 2024 20:10

mention the Array API issue on python scalars in a comment

8dbacbc

Merge branch 'main' into bugfix-scalar-arr-casting

d0e08d8

Merge branch 'main' into bugfix-scalar-arr-casting

4c094c1

flamingbear merged commit 2013e7f into pydata:main Jun 11, 2024
27 of 28 checks passed

Uh oh!

Fix upcasting with python builtin numbers and numpy 2 #8946

Fix upcasting with python builtin numbers and numpy 2 #8946

Uh oh!

Conversation

djhoese commented Apr 15, 2024 • edited by keewis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djhoese commented Apr 15, 2024

Uh oh!

djhoese commented Apr 15, 2024

Uh oh!

dcherian commented Apr 15, 2024

Uh oh!

djhoese commented Apr 15, 2024

Uh oh!

djhoese commented Apr 16, 2024

Uh oh!

dcherian commented Apr 18, 2024

Uh oh!

keewis commented Apr 22, 2024

Uh oh!

djhoese commented Apr 22, 2024

Uh oh!

keewis commented Apr 22, 2024

Uh oh!

djhoese commented Apr 22, 2024

Uh oh!

djhoese commented Apr 22, 2024

Uh oh!

keewis commented Apr 22, 2024

Uh oh!

djhoese commented Apr 26, 2024

Uh oh!

keewis commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djhoese commented Apr 28, 2024

Uh oh!

keewis commented Apr 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keewis commented Apr 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keewis commented Apr 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djhoese commented May 12, 2024

Uh oh!

Uh oh!

keewis commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keewis commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dcherian commented Jun 10, 2024

Uh oh!

Uh oh!

keewis commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

djhoese commented Apr 15, 2024 •

edited by keewis

Loading

keewis commented Apr 26, 2024 •

edited

Loading

keewis commented Apr 28, 2024 •

edited

Loading

keewis commented Apr 28, 2024 •

edited

Loading

keewis commented Apr 28, 2024 •

edited

Loading

keewis commented Jun 7, 2024 •

edited

Loading

keewis commented Jun 10, 2024 •

edited

Loading

keewis commented Jun 11, 2024 •

edited

Loading