Skip to content

Boolean array indexing is not compatible with static memory allocation #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Nov 11, 2020 · 19 comments
Closed
Labels
topic: Dynamic Shapes Data-dependent shapes. topic: Indexing Array indexing. topic: Lazy/Graph Lazy and graph-based array implementations.

Comments

@shoyer
Copy link
Contributor

shoyer commented Nov 11, 2020

JAX is built on top of XLA which (currently) requires static memory allocation for operations. This means it is not possible to express operations like x[y] where y is a boolean, because the resulting array has value dependent size: https://data-apis.github.io/array-api/latest/API_specification/indexing.html#boolean-array-indexing

It's definitely possible that the static memory allocation requirement could be relaxed in the future, but dynamic memory allocation is always going to be harder to implement in a performant way. For example, I would guess Numba also struggles with this sort of operation. I don't think we should require it for array libraries implementing our standard, since it isn't needed for the majority of array operations.

@TomAugspurger
Copy link

TomAugspurger commented Nov 12, 2020 via email

@rgommers
Copy link
Member

rgommers commented Dec 3, 2020

For example, I would guess Numba also struggles with this sort of operation

Numba does support boolean indexing, but only for 1-D arrays - and you cannot mix it with other advanced indexing.

@rgommers
Copy link
Member

rgommers commented Dec 3, 2020

Related JAX issues with discussion:

It does seem that there's very limited support for boolean indexing (only static boolean arrays, which are not all that useful in practice), and none for jitted functions.

This led me to wonder about other functions where output shape depends on values in the input, such as unique. That is implemented, but also not jittable: jax-ml/jax#2760

TensorFlow is in a similar boat, but provides a boolean_mask function instead. And there's an open PR for JAX to add index_mask which looks similar.

EDIT: add the most informative comment from the JAX PR:
We sometimes need to perform boolean indexing inside of jax.jit, and found that it's useful to have a function which does that while padding the output to the same shape as the original input.

@rgommers
Copy link
Member

rgommers commented Dec 3, 2020

nonzero and where are also problematic here. Both are implemented in jax.numpy but are not jittable.

@rgommers
Copy link
Member

rgommers commented Dec 3, 2020

I think that's the complete list of functionality where shapes depend on values - so maybe that's a good set to separate out somehow. They are fairly heavily used functions, so would be good to dig up some examples - say in SciPy or scikit-learn - and figure out if those are easy to rewrite.

@rgommers
Copy link
Member

rgommers commented Dec 3, 2020

Of course this can happen with basic indexing as well:

x = arange(5)
idx_max = random.randint(0, 5)
x2 = x[:idx_max]

So there needs to be something said about the principle of shapes depending on values.

@rgommers
Copy link
Member

rgommers commented Dec 7, 2020

The summary of a discussion we had a few days ago:

  • boolean indexing is fairly fundamental to how people use NumPy; leaving it out completely may cause people to simply ignore the standard.
  • given that this type of data-dependent shape can be produced by several other functions, we need a way to clearly mark that as a property of those functions and of boolean indexing. I.e. with color-coding or a custom admonition in the API standard document.

@jakirkham
Copy link
Member

Dask’s lazy evaluation struggles with this a bit too. The output shape of a Boolean mask will have NaN for the shape in that dimension, and .size will be NaN.

This is true. Though there are some subtleties (which I'm sure Tom also knows), but others here may not. For example, if we know the values in the mask beforehand (it's not a Dask Array) and raveling is not involved (it's applied along a single dimension), we can produce a Dask Array of known shape. However if one of those things is not true, then the Dask Array's shape is undetermined.

In [1]: import numpy as np

In [2]: import dask.array as da

In [3]: X = da.ones((10, 11, 12),chunks=(2, 3, 4))

In [4]: Y = np.arange(10) > 5

In [5]: X[Y].chunks
Out[5]: ((2, 2), (3, 3, 3, 2), (4, 4, 4))

In [6]: Y2 = np.arange(10 * 11 * 12).reshape(10, 11, 12) > (10 * 11 * 12 / 2)

In [7]: X[Y2].chunks
Out[7]: ((nan, nan, nan, nan, nan, nan, nan, nan, nan, nan),)

In [8]: Y3 = da.asarray(Y, chunks=2)

In [9]: X[Y3].chunks
Out[9]: ((nan, nan, nan, nan, nan), (3, 3, 3, 2), (4, 4, 4))

@leofang
Copy link
Contributor

leofang commented Mar 19, 2021

Sorry, looks like I overlooked this discussion. I don't think CuPy has full support over boolean masks either: cupy/cupy#4693 (comment).

@rgommers
Copy link
Member

From the CuPy issue: Combining a multidimensional boolean array with other indexing causes an IndexError.

https://data-apis.github.io/array-api/latest/API_specification/indexing.html#boolean-array-indexing doesn't say it's supported to combine boolean indexing with another indexing mode, however it's not stated clearly. The whole indexing section needs to be easier to understand.

For boolean indexing in particular, the way I read what's there now is: for a boolean array index ix, only x[ix] is supported (also for multi-dimensional ix), but this is not: x[ix, 0]. Which is odd of course, because writing it as x[:, 0][ix] is a more clumsy way to do the same thing and is supported.

I believe the intent was to allow combining a single boolean index with regular basic indexing.

@asmeurer
Copy link
Member

asmeurer commented Apr 1, 2021

Indeed. I had interpreted it to mean that boolean indices were only allowed as the sole index, and were not required to be supported as part of a larger multidimensional index. I believe this was intentional, because mixing a boolean array index with an slice and integer index in NumPy can trigger the weird "advanced indices separated by slice, ellipsis, or newaxis" rule (https://numpy.org/doc/stable/reference/arrays.indexing.html#combining-advanced-and-basic-indexing). Note that for the purposes of this rule, integer indices are treated like 0-dimensional integer array indices:

>>> b = np.array([[True, False], [True, True]])
>>> a = np.arange(1*2*2*4*5).reshape((1, 2, 2, 4, 5))
>>> a[:, b, :].shape
(1, 3, 4, 5)
>>> a[:, b, :, 0].shape
(3, 1, 4)

(note how the 3 dimension is moved to the front when the boolean and integer indices are separated by a slice).

My understanding is that we want to avoid making this rule part of the spec (Travis had expressed to me that he wished it to not be there, and that he would prefer if NumPy didn't do it in the first place). So if we do allow boolean arrays as part of larger multidimensional indices, it needs to be with caveats that disallow this case.

Another oddity I just noticed the boolean indexing section: it says "this is equivalent to A[nonzero(B)]", but that only works if we support integer array indexing, which we don't. So this should be rephrased in a way that doesn't assume that integer array indexing has been specified already.

@asmeurer
Copy link
Member

asmeurer commented Apr 1, 2021

(and to be clear, either way, the spec should be clarified, because I agree the way it's written now is a bit ambiguous)

@asmeurer
Copy link
Member

dask.array has a restriction that bool indexing can be combined with other indexing only when it has dimension 1. However, it implements different semantics from NumPy in the above case:

>>> anp = np.arange(3*4*5*6).reshape((3, 4, 5, 6))
>>> bnp = np.array([True, True, False, False])
>>> anp[:, bnp, :].shape
(3, 2, 5, 6)
>>> anp[:, bnp, :, 0].shape
(2, 3, 5)
>>> a = dask.array.arange(3*4*5*6).reshape((3, 4, 5, 6))
>>> b = dask.array.array([True, True, False, False])
>>> a[:, b, :].shape
(3, nan, 5, 6)
>>> a[:, b, :, 0].shape
(3, nan, 5)

Pytorch also appears to implement this but with different semantics

>>> at = torch.arange(3*4*5*6).reshape((3, 4, 5, 6))
>>> bt = torch.tensor([True, True, False, False])
>>> at[:, bt, :].shape
torch.Size([3, 2, 5, 6])
>>> at[:, bt, :, 0].shape
torch.Size([3, 2, 5])

jax.numpy appears to follow numpy semantics

>>> aj = jax.numpy.arange(3*4*5*6).reshape((3, 4, 5, 6))
>>> bj = jax.numpy.asarray([True, True, False, False])
>>> aj[:, bj, :].shape
(3, 2, 5, 6)
>>> aj[:, bj, :, 0].shape
(2, 3, 5)

Neither jax nor pytorch seem to have the same restriction as dask on what kinds of boolean arrays are allowed. @jakirkham what is the purpose of the restrictions dask has on boolean array indices?

@jakirkham
Copy link
Member

Sorry I'm not following Aaron. Could you please share what is being tried that doesn't work?

@asmeurer
Copy link
Member

What is the purpose of this restriction in dask?

>>> a = dask.array.arange(2*2*4*5).reshape((2, 2, 4, 5))
>>> b = dask.array.array([[True, False], [True, True]])
>>> a[b]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/aaronmeurer/anaconda3/envs/array-apis/lib/python3.9/site-packages/dask/array/core.py", line 1705, in __getitem__
    self, index2 = slice_with_bool_dask_array(self, index2)
  File "/Users/aaronmeurer/anaconda3/envs/array-apis/lib/python3.9/site-packages/dask/array/slicing.py", line 1157, in slice_with_bool_dask_array
    raise NotImplementedError(
NotImplementedError: Slicing with dask.array of bools only permitted when the indexer has only one dimension or when it has the same dimension as the sliced array

This already isn't sufficient for the conservative interpretation of what is currently in the spec about boolean indexing, and it would presumably make it even harder to support indices combining boolean and non-boolean indexing like a[b, 0].

@jakirkham
Copy link
Member

Raised as issue ( dask/dask#7550 ). That said, I wouldn't be surprised if there are certain combinations passed to slicing that don't work. There is not one reason that captures all of these. In some cases there may be a lack of interest (no one has done a PR), some are technically challenging (so may not be implementable), and in some cases they are solvable, but following NumPy exactly leads to performance issues or doesn't make sense for Dask. The latter is particularly noticeable when raveling happens implicitly ( see issue dask/dask#4855 )

@asmeurer
Copy link
Member

Well the whole point of this issue is to figure out what we want to not include in the spec because it is too hard for libraries like dask, specifically relating to boolean indexing (integer array indexing is the other "hard" case, but it's already not in the spec). The spec does specify that boolean indices have to give C row-first order, so that is possibly relevant to dask/dask#4855. The specific question here is if combining (a single) boolean index with other indices is any harder than only supporting a single boolean index (sans the weird "boolean and integer separated by a slice" case that dask does differently from NumPy, and shouldn't be part of the spec regardless).

Going back to the case not supported by dask, does the fact that the array must be broadcasted make it harder? Should we restrict what the spec says even further by requiring not only that a boolean index be the sole index, but that it be the exact same shape as the indexed array?

asmeurer added a commit to asmeurer/array-api that referenced this issue Apr 14, 2021
We only require that boolean array indexing be supported when the array is the
sole index. The previous text was ambiguous, and could be read as supporting
boolean arrays plus other non-array indices as part of a larger
multidimensional index. However, this is complicated in general, as it can
lead to some corner cases. See the discussion in data-apis#84.

With that being said, it is still an open discussion whether we should allow
some subset of boolean array indices along with other non-array indices in the
same index. I believe regardless, what we do not want to allow is:

- multiple boolean arrays (this requires broadcasting semantics, which are
  confusing for boolean arrays)
- boolean arrays and integers separated by a non-integer index (for example
  a[b, :, 0] where b is a boolean array). This is the corner case currently
  implemented by NumPy which we want to avoid for the standard.
@asmeurer
Copy link
Member

I opened a PR clarifying that the spec only requires boolean arrays as the sole index, not as part of a larger multidimensional index. #164

If we do want to allow them as part of a multidimensional index, we should discuss that further. The big open question is whether that causes any issues for any libraries, beyond the issues that are already caused by supporting boolean array indices in the first place.

rgommers pushed a commit that referenced this issue Apr 19, 2021
We only require that boolean array indexing be supported when the array is the
sole index. The previous text was ambiguous, and could be read as supporting
boolean arrays plus other non-array indices as part of a larger
multidimensional index. However, this is complicated in general, as it can
lead to some corner cases. See the discussion in #84.

With that being said, it is still an open discussion whether we should allow
some subset of boolean array indices along with other non-array indices in the
same index. I believe regardless, what we do not want to allow is:

- multiple boolean arrays (this requires broadcasting semantics, which are
  confusing for boolean arrays)
- boolean arrays and integers separated by a non-integer index (for example
  a[b, :, 0] where b is a boolean array). This is the corner case currently
  implemented by NumPy which we want to avoid for the standard.
@rgommers rgommers added the topic: Indexing Array indexing. label Jun 30, 2021
@kgryte
Copy link
Contributor

kgryte commented Nov 10, 2021

As this issue should be addressed (boolean indexing is optional and admonitions have been added to the specification), will close this out.

If additional discussion is necessary, we can either reopen this issue or open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: Dynamic Shapes Data-dependent shapes. topic: Indexing Array indexing. topic: Lazy/Graph Lazy and graph-based array implementations.
Projects
None yet
Development

No branches or pull requests

7 participants