-
Notifications
You must be signed in to change notification settings - Fork 52
Boolean array indexing is not compatible with static memory allocation #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Dask’s lazy evaluation struggles with this a bit too. The output shape of a Boolean mask will have NaN for the shape in that dimension, and .size will be NaN.
… On Nov 11, 2020, at 17:11, Stephan Hoyer ***@***.***> wrote:
JAX is built on top of XLA which (currently) requires static memory allocation for operations. This means it is not possible to express operations like x[y] where y is a boolean, because the resulting array has a dynamic size: https://data-apis.github.io/array-api/latest/API_specification/indexing.html#boolean-array-indexing
It's definitely possible that the static memory allocation requirement could be relaxed in the future, but dynamic memory allocation is always going to be harder to implement in a performant way. For example, I would guess Numba also struggles with this sort of operation. I don't think we should require it for array libraries implementing our standard, since it isn't needed for the majority of array operations.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Numba does support boolean indexing, but only for 1-D arrays - and you cannot mix it with other advanced indexing. |
Related JAX issues with discussion:
It does seem that there's very limited support for boolean indexing (only static boolean arrays, which are not all that useful in practice), and none for jitted functions. This led me to wonder about other functions where output shape depends on values in the input, such as TensorFlow is in a similar boat, but provides a boolean_mask function instead. And there's an open PR for JAX to add EDIT: add the most informative comment from the JAX PR: |
|
I think that's the complete list of functionality where shapes depend on values - so maybe that's a good set to separate out somehow. They are fairly heavily used functions, so would be good to dig up some examples - say in SciPy or scikit-learn - and figure out if those are easy to rewrite. |
Of course this can happen with basic indexing as well:
So there needs to be something said about the principle of shapes depending on values. |
The summary of a discussion we had a few days ago:
|
This is true. Though there are some subtleties (which I'm sure Tom also knows), but others here may not. For example, if we know the values in the mask beforehand (it's not a Dask Array) and raveling is not involved (it's applied along a single dimension), we can produce a Dask Array of known shape. However if one of those things is not true, then the Dask Array's shape is undetermined. In [1]: import numpy as np
In [2]: import dask.array as da
In [3]: X = da.ones((10, 11, 12),chunks=(2, 3, 4))
In [4]: Y = np.arange(10) > 5
In [5]: X[Y].chunks
Out[5]: ((2, 2), (3, 3, 3, 2), (4, 4, 4))
In [6]: Y2 = np.arange(10 * 11 * 12).reshape(10, 11, 12) > (10 * 11 * 12 / 2)
In [7]: X[Y2].chunks
Out[7]: ((nan, nan, nan, nan, nan, nan, nan, nan, nan, nan),)
In [8]: Y3 = da.asarray(Y, chunks=2)
In [9]: X[Y3].chunks
Out[9]: ((nan, nan, nan, nan, nan), (3, 3, 3, 2), (4, 4, 4)) |
Sorry, looks like I overlooked this discussion. I don't think CuPy has full support over boolean masks either: cupy/cupy#4693 (comment). |
From the CuPy issue: Combining a multidimensional boolean array with other indexing causes an IndexError. https://data-apis.github.io/array-api/latest/API_specification/indexing.html#boolean-array-indexing doesn't say it's supported to combine boolean indexing with another indexing mode, however it's not stated clearly. The whole indexing section needs to be easier to understand. For boolean indexing in particular, the way I read what's there now is: for a boolean array index I believe the intent was to allow combining a single boolean index with regular basic indexing. |
Indeed. I had interpreted it to mean that boolean indices were only allowed as the sole index, and were not required to be supported as part of a larger multidimensional index. I believe this was intentional, because mixing a boolean array index with an slice and integer index in NumPy can trigger the weird "advanced indices separated by slice, ellipsis, or newaxis" rule (https://numpy.org/doc/stable/reference/arrays.indexing.html#combining-advanced-and-basic-indexing). Note that for the purposes of this rule, integer indices are treated like 0-dimensional integer array indices: >>> b = np.array([[True, False], [True, True]])
>>> a = np.arange(1*2*2*4*5).reshape((1, 2, 2, 4, 5))
>>> a[:, b, :].shape
(1, 3, 4, 5)
>>> a[:, b, :, 0].shape
(3, 1, 4) (note how the My understanding is that we want to avoid making this rule part of the spec (Travis had expressed to me that he wished it to not be there, and that he would prefer if NumPy didn't do it in the first place). So if we do allow boolean arrays as part of larger multidimensional indices, it needs to be with caveats that disallow this case. Another oddity I just noticed the boolean indexing section: it says "this is equivalent to |
(and to be clear, either way, the spec should be clarified, because I agree the way it's written now is a bit ambiguous) |
dask.array has a restriction that bool indexing can be combined with other indexing only when it has dimension 1. However, it implements different semantics from NumPy in the above case: >>> anp = np.arange(3*4*5*6).reshape((3, 4, 5, 6))
>>> bnp = np.array([True, True, False, False])
>>> anp[:, bnp, :].shape
(3, 2, 5, 6)
>>> anp[:, bnp, :, 0].shape
(2, 3, 5)
>>> a = dask.array.arange(3*4*5*6).reshape((3, 4, 5, 6))
>>> b = dask.array.array([True, True, False, False])
>>> a[:, b, :].shape
(3, nan, 5, 6)
>>> a[:, b, :, 0].shape
(3, nan, 5) Pytorch also appears to implement this but with different semantics >>> at = torch.arange(3*4*5*6).reshape((3, 4, 5, 6))
>>> bt = torch.tensor([True, True, False, False])
>>> at[:, bt, :].shape
torch.Size([3, 2, 5, 6])
>>> at[:, bt, :, 0].shape
torch.Size([3, 2, 5]) jax.numpy appears to follow numpy semantics >>> aj = jax.numpy.arange(3*4*5*6).reshape((3, 4, 5, 6))
>>> bj = jax.numpy.asarray([True, True, False, False])
>>> aj[:, bj, :].shape
(3, 2, 5, 6)
>>> aj[:, bj, :, 0].shape
(2, 3, 5) Neither jax nor pytorch seem to have the same restriction as dask on what kinds of boolean arrays are allowed. @jakirkham what is the purpose of the restrictions dask has on boolean array indices? |
Sorry I'm not following Aaron. Could you please share what is being tried that doesn't work? |
What is the purpose of this restriction in dask? >>> a = dask.array.arange(2*2*4*5).reshape((2, 2, 4, 5))
>>> b = dask.array.array([[True, False], [True, True]])
>>> a[b]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/aaronmeurer/anaconda3/envs/array-apis/lib/python3.9/site-packages/dask/array/core.py", line 1705, in __getitem__
self, index2 = slice_with_bool_dask_array(self, index2)
File "/Users/aaronmeurer/anaconda3/envs/array-apis/lib/python3.9/site-packages/dask/array/slicing.py", line 1157, in slice_with_bool_dask_array
raise NotImplementedError(
NotImplementedError: Slicing with dask.array of bools only permitted when the indexer has only one dimension or when it has the same dimension as the sliced array This already isn't sufficient for the conservative interpretation of what is currently in the spec about boolean indexing, and it would presumably make it even harder to support indices combining boolean and non-boolean indexing like |
Raised as issue ( dask/dask#7550 ). That said, I wouldn't be surprised if there are certain combinations passed to slicing that don't work. There is not one reason that captures all of these. In some cases there may be a lack of interest (no one has done a PR), some are technically challenging (so may not be implementable), and in some cases they are solvable, but following NumPy exactly leads to performance issues or doesn't make sense for Dask. The latter is particularly noticeable when raveling happens implicitly ( see issue dask/dask#4855 ) |
Well the whole point of this issue is to figure out what we want to not include in the spec because it is too hard for libraries like dask, specifically relating to boolean indexing (integer array indexing is the other "hard" case, but it's already not in the spec). The spec does specify that boolean indices have to give C row-first order, so that is possibly relevant to dask/dask#4855. The specific question here is if combining (a single) boolean index with other indices is any harder than only supporting a single boolean index (sans the weird "boolean and integer separated by a slice" case that dask does differently from NumPy, and shouldn't be part of the spec regardless). Going back to the case not supported by dask, does the fact that the array must be broadcasted make it harder? Should we restrict what the spec says even further by requiring not only that a boolean index be the sole index, but that it be the exact same shape as the indexed array? |
We only require that boolean array indexing be supported when the array is the sole index. The previous text was ambiguous, and could be read as supporting boolean arrays plus other non-array indices as part of a larger multidimensional index. However, this is complicated in general, as it can lead to some corner cases. See the discussion in data-apis#84. With that being said, it is still an open discussion whether we should allow some subset of boolean array indices along with other non-array indices in the same index. I believe regardless, what we do not want to allow is: - multiple boolean arrays (this requires broadcasting semantics, which are confusing for boolean arrays) - boolean arrays and integers separated by a non-integer index (for example a[b, :, 0] where b is a boolean array). This is the corner case currently implemented by NumPy which we want to avoid for the standard.
I opened a PR clarifying that the spec only requires boolean arrays as the sole index, not as part of a larger multidimensional index. #164 If we do want to allow them as part of a multidimensional index, we should discuss that further. The big open question is whether that causes any issues for any libraries, beyond the issues that are already caused by supporting boolean array indices in the first place. |
We only require that boolean array indexing be supported when the array is the sole index. The previous text was ambiguous, and could be read as supporting boolean arrays plus other non-array indices as part of a larger multidimensional index. However, this is complicated in general, as it can lead to some corner cases. See the discussion in #84. With that being said, it is still an open discussion whether we should allow some subset of boolean array indices along with other non-array indices in the same index. I believe regardless, what we do not want to allow is: - multiple boolean arrays (this requires broadcasting semantics, which are confusing for boolean arrays) - boolean arrays and integers separated by a non-integer index (for example a[b, :, 0] where b is a boolean array). This is the corner case currently implemented by NumPy which we want to avoid for the standard.
As this issue should be addressed (boolean indexing is optional and admonitions have been added to the specification), will close this out. If additional discussion is necessary, we can either reopen this issue or open a new one. |
JAX is built on top of XLA which (currently) requires static memory allocation for operations. This means it is not possible to express operations like
x[y]
wherey
is a boolean, because the resulting array has value dependent size: https://data-apis.github.io/array-api/latest/API_specification/indexing.html#boolean-array-indexingIt's definitely possible that the static memory allocation requirement could be relaxed in the future, but dynamic memory allocation is always going to be harder to implement in a performant way. For example, I would guess Numba also struggles with this sort of operation. I don't think we should require it for array libraries implementing our standard, since it isn't needed for the majority of array operations.
The text was updated successfully, but these errors were encountered: