introspecting nested duck arrays #843

keewis · 2024-09-21T19:11:18Z

keewis
Sep 21, 2024

This is only related to the array API standard, but over the past year I've been thinking on and off about how to deal with dask-wrapped cupy arrays in xarray. While I am focusing on dask+cupy here, I believe that would also be helpful for any other type of (possibly deeper) nested arrays like a unit-aware, masked and chunked sparse array, which involves layering 4 types: the unit-aware type, the masked array type, the chunking type, and the sparse array type.

The main issue is that cupy itself absolutely refuses to interact with any kind of numpy array, even 0d arrays. This is an issue, because most functions in the array API standard are explicitly defined in terms of arrays, which means that for scalars xarray will have to figure out which type to convert to. For simple arrays this would simply defer to xp.asarray (as pointed out in the issue above by @rgommers), but as soon as we have a more deeply nested array this becomes impractical.

Instead, I believe the best way to resolve this is to figure out which of the layers of array type is responsible for the actual bytes in memory (usually, this will be the innermost array layer). For this purpose I've come up with a recursive protocol (tentatively named __array_layers__) that is only supposed to be defined by array-wrapping libraries.

Calling that protocol would return a tuple of type objects, one for each layer and with the outermost layer at index 0. For the example above, using pint, dask, marray and sparse (marray, I believe, is still experimental so this might not actually work):

>>> x.__array_layers__()
(pint.Quantity, dask.array.Array, marray.MaskedArray, sparse.COO)

This would allow any library to inspect the stack of array types, and would allow xarray to find cupy (and the cupy namespace) underneath other layers of duck arrays.

What I'm hoping for in opening this discussion is feedback, mostly on the protocol (the name and the mechanism used to define it), but maybe also whether you can see a way to resolve my issue without a new protocol.

cc @shoyer, @tomwhite, @TomNicholas, @rgommers

lucascolley · 2024-09-21T19:42:21Z

lucascolley
Sep 21, 2024
Collaborator

For simple arrays this would simply defer to xp.asarray (as pointed out in the issue above by @rgommers), but as soon as we have a more deeply nested array this becomes impractical.

If I understand you correctly, for the (pint.Quantity, dask.array.Array, marray.MaskedArray, sparse.COO) example you are wanting to access the sparse namespace to work with scalars. The problem is that x.__array_namespace__ returns a pint namespace rather than a, say, "pint(dask(marray(sparse)))" namespace, correct? If (hypothetically) you did get the latter, that namespace should still be compatible with the underlying sparse type for standard operations, right? So perhaps this is implementable just by array-wrapping libraries returning a different namespace based on the namespace of the library one-level below (the one they directly wrap)?

I may be completely misunderstanding!

7 replies

keewis Sep 21, 2024
Author

let's look at the original motivating example, which is dask(cupy). In that case, we don't really need the dask layer since we already have the scalar in memory. Instead, we only want a 0d cupy array without having to explicitly tell the processing library how to find the correct namespace. Also, most of the time the scalars are default values like np.nan for xp.where (which really is the motivating use case).

What I want to do is:

xp = some_nested_array_with_cupy.__array_namespace__()
arr = xp.asarray(known_scalar_input)  # mostly `nan`, but could be something else
xp.where(cond, some_nested_array_with_cupy, arr)  # doesn't work at the moment, because `xp` is the outermost wrapping library's namespace

You're correct, though, that in some cases we do want to reproduce the full stack of layers, for example with xp.full_like. I do feel like that's something to resolve when it is actually needed, I currently don't have an actual use case.

lucascolley Sep 21, 2024
Collaborator

Thanks, I think I understand. I think you may be technically straying into undefined behaviour here? The standard docs say

Non-goals for the API standard include:

Making it possible to mix multiple array libraries in function calls.

Most array libraries do not know about other libraries, and the functions they implement may try to convert “foreign” input, or raise an exception. This behaviour is hard to specify; ensuring only a single array type is used is best left to the end user.

Of course, this is one of those cases where Dask is actually well-aware that CuPy exists (in fact it wants you to provide a CuPy array). But strictly xp.where(..., some_dask_cupy_array, some_cupy_array) is undefined I think.

From the standard's perspective, the bug is:

xp.where(cond, some_nested_array_with_cupy, arr)  # doesn't work at the moment, because `xp` is the outermost wrapping library's namespace

That should work if xp conforms to the standard. For the dask(cupy) example, is the problem that

> some_dask_cupy_array.__array_namespace__.asarray(float(5.0))
np.float64(5.0)

and np.float64(5.0) is not interoperable with the CuPy-backed array in where? (I'm just guessing here, I haven't checked if this is how Dask behaves).

EDIT: changed example from nan to numerical float

keewis Sep 21, 2024
Author

I think you may be technically straying into undefined behaviour here?

yep, that's why I'm saying this is related, but am trying to solve this outside of the array API standard (i.e. this is not a request to extend the standard, instead I'm using this as a discussion forum about array standardization).

For the dask(cupy) example, is the problem that

No, this not the problem. Rather, xp.where only supports array inputs, so a simple np.nan (which really is an alias of float("nan")) or 0 / 1 has to be converted to an array, and it is choosing the right array type for this that is my issue.

I guess this example is further complicated by dask not actually implementing __array_namespace__, which means my motivating use case does not have a lot to do with the array API. However, xarray is trying to be able to dispatch to multiple lazy array computation frameworks, so if you replace dask with cubed that example should still work.

In any case, xarray being pretty old by now this is all a mix between the array API and the older __array_{ufunc,function}__ protocols, so forgive me if I'm a bit fuzzy here. I guess this is the result of xarray trying to support all of these protocols at the same time.

lucascolley Sep 21, 2024
Collaborator

I guess this example is further complicated by dask not actually implementing array_namespace, which means my motivating use case does not have a lot to do with the array API. However, xarray is trying to be able to dispatch to multiple lazy array computation frameworks, so if you replace dask with cubed that example should still work.

Right, I guess the key question is whether my initial assumption that every wrapper-library in the chain can implement __array_namespace__ is a reasonable one. If you are specifically considering support for libraries that will not be able to provide __array_namespace__, then I am barking up the wrong tree :)

I suppose I was motivated by the mention of marray, which is specifically designed to wrap standard-compatible arrays and provide a compatible namespace.

Rather, xp.where only supports array inputs, so a simple np.nan (which really is an alias of float("nan")) or 0 / 1 has to be converted to an array, and it is choosing the right array type for this that is my issue.

Could you show the code that wouldn't work? If some_dask_cupy_array.__array_namespace__.asarray(float(5.0)) returned a CuPy-backed Dask array, then your xp.where example above should work fine?

I know that Dask doesn't implement __array_namespace__, but there is partial support in array-api-compat. We're not a million miles off getting it mostly working in SciPy.

keewis Sep 22, 2024
Author

If some_dask_cupy_array.__array_namespace__.asarray(float(5.0)) returned a CuPy-backed Dask array, then your xp.where example above should work fine?

It should, albeit slightly less efficiently (though dask should be able to optimize that inefficiency away)

rgommers · 2024-09-22T08:08:11Z

rgommers
Sep 22, 2024
Maintainer

The main issue is that cupy itself absolutely refuses to interact with any kind of numpy array, even 0d arrays. This is an issue, because most functions in the array API standard are explicitly defined in terms of arrays, which means that for scalars xarray will have to figure out which type to convert to

I'll note that @shoyer brought this up as the key issue making it hard for Xarray to fully implement the standard in gh-807, and the conclusion of that discussion is that we will allow scalars more broadly in all functions (in the next revision of the standard, v2024), as long as there is still a single input argument that's an array so that the output type/device/dtype/etc. can be determined.

Does that address the problem well enough from your perspective @keewis?

1 reply

keewis Sep 22, 2024
Author

I do think the protocol I've come up with is still useful in other situations, but this could potentially resolve the particular issue I had with dask+cupy, yes. I'll have to figure out how to combine that with xarray's casting rules for some of the numpy dtypes (which are all not part of the array API standard), though.

asmeurer · 2024-09-23T21:32:02Z

asmeurer
Sep 23, 2024
Maintainer

If some_dask_cupy_array.array_namespace.asarray(float(5.0)) returned a CuPy-backed Dask array, then your xp.where example above should work fine?

I agree with this notion. An advantage of the x.__array_namespace__() API is that it doesn't have to be just the single top-level namespace every time. It could be parameterized per array. So if x is a CuPy-backed Dask array, x.__array_namespace__() could be a namespace with "cupy-mode" enabled that configures the creation functions to return CuPy arrays (or at least configures operations to not try to do things that will cause CuPy to error out).

More generally, it seems to me that if Dask knows it is wrapping CuPy arrays, then it should be Dask's responsibility to make sure functions like asarray return CuPy-wrapped arrays. Trying to get the array consumer at the top to handle everything sounds like a nightmare for every array library to have to handle, vs. just requiring each "array wrapping library" to (recursively) do the right thing with the array types it is wrapping.

As far as introspection, I would worry whether introspection APIs might limit the sorts of wrapping that can happen.

0 replies

introspecting nested duck arrays #843

Uh oh!

Uh oh!

keewis Sep 21, 2024

Replies: 3 comments · 8 replies

Uh oh!

Uh oh!

lucascolley Sep 21, 2024 Collaborator

Uh oh!

Uh oh!

keewis Sep 21, 2024 Author

Uh oh!

Uh oh!

lucascolley Sep 21, 2024 Collaborator

Uh oh!

keewis Sep 21, 2024 Author

Uh oh!

Uh oh!

lucascolley Sep 21, 2024 Collaborator

Uh oh!

keewis Sep 22, 2024 Author

Uh oh!

rgommers Sep 22, 2024 Maintainer

Uh oh!

keewis Sep 22, 2024 Author

Uh oh!

asmeurer Sep 23, 2024 Maintainer

keewis
Sep 21, 2024

Replies: 3 comments 8 replies

lucascolley
Sep 21, 2024
Collaborator

keewis Sep 21, 2024
Author

lucascolley Sep 21, 2024
Collaborator

keewis Sep 21, 2024
Author

lucascolley Sep 21, 2024
Collaborator

keewis Sep 22, 2024
Author

rgommers
Sep 22, 2024
Maintainer

keewis Sep 22, 2024
Author

asmeurer
Sep 23, 2024
Maintainer