Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data dependent/unknown shapes #97

Closed
shoyer opened this issue Dec 3, 2020 · 18 comments · Fixed by #289
Closed

Data dependent/unknown shapes #97

shoyer opened this issue Dec 3, 2020 · 18 comments · Fixed by #289
Milestone

Comments

@shoyer
Copy link
Contributor

shoyer commented Dec 3, 2020

Some libraries, particularly those with a graph-based computational model (e.g., Dask and TensorFlow), have support for "unknown" or "data dependent" shapes, e.g., due to boolean indexing such as x[y > 0] (#84). Other libraries (e.g., JAX and Dask in some cases) do not support some operations because they would produce such data dependent shapes.

We should consider a standard way to represent these shapes in shape attributes, ideally some extension of the "tuple of integer" format used for fully known shapes. For example, TensorFlow and Dask currently use different representations:

  • TensorFlow uses a custom TensorShape object (which acts very similarly to tuple), where some values may be None
  • Dask uses tuples, where some values may be nan integer of integers
@rgommers
Copy link
Member

rgommers commented Dec 3, 2020

That makes sense. A custom tuple-like object, with None for not-yet-know dimensions, seems a bit more flexible and appropriate to me.

@rgommers
Copy link
Member

rgommers commented Dec 3, 2020

We actually already allow for either a tuple or a custom shape object (https://data-apis.github.io/array-api/latest/API_specification/array_object.html#shape), so it's None vs. nan that needs deciding.

@shoyer
Copy link
Contributor Author

shoyer commented Dec 3, 2020 via email

@rgommers
Copy link
Member

rgommers commented Dec 3, 2020

IIRC because using Python objects (ints, tuples, etc.) rather than custom ones can force extra synchronizations. PyTorch also uses a custom object:

>>> t = torch.tensor([[1, 2], [3, 4]])                                                         
>>> t.shape                                                                                    
torch.Size([2, 2])

@shoyer
Copy link
Contributor Author

shoyer commented Dec 6, 2020

IIRC because using Python objects (ints, tuples, etc.) rather than custom ones can force extra synchronizations

I guess this is for cases where some parts of shape may require on-device computation? E.g., z = x[:, y > 0], then len(z.shape) == 2 and z.shape[0] == x.shape[0] (no synchronization required), but z.shape[1] may be expensive to calculate.

Interestingly, I think TensorFlow has this for different reasons, possibly just for preserving backwards compatibility at this point. It has two ways of accessing shape information: tf.shape(x) which returns the dynamic shape as a Tensor and x.shape which returns the static shape (possibly including None for dynamic sizes).

@rgommers
Copy link
Member

rgommers commented Dec 7, 2020

It may be due to legacy reasons in PyTorch as well. From pytorch/pytorch#20823 (comment):
Maybe it's time to consider removing torch.Size and make tensor.shape return a tuple? I believe torch.Size was necessary to support legacy constructors, like torch.Tensor(a.size()), which overloaded torch.Tensor with both a size and a data constructor.

I remember a discussion about this when we reviewed the array object API, and someone (probably @alextp or @apaszke) suggested adding the custom object.

@alextp
Copy link

alextp commented Dec 7, 2020

I think the custom object is only useful when you want to deal with staged computation like @shoyer mentions above. TF2 stopped using the wrapped Dimension type and just has its TensorShape be a tuple of either integers or None to indicate shapes which we do not know at tracing time. So I think saying shapes are tuples here is probably a good idea.

@agarwal-ashish
Copy link

TensorFlow (and TF NumPy) will be returning a TensorShape object and tensor.shape is None will not work when checking for unknown rank.

Perhaps we should reconsider allowing custom objects.
Alternatively users could check for ndarray.ndim is None leaving ndarray.shape undefined for unknown rank cases.

@alextp
Copy link

alextp commented Dec 10, 2020 via email

@rgommers
Copy link
Member

rgommers commented Dec 16, 2020

Alternatively users could check for ndarray.ndim is None leaving ndarray.shape undefined for unknown rank cases.

I don't think we included anything in the standard that could cause this. Indexing may make the size of one or more dimensions unknown, but ndim should always be known because for data-dependent operations there's no squeezing:

>>> x = np.ones((3,2))
>>> x[:0, np.zeros(2, dtype=bool)]
array([], shape=(0, 0), dtype=float64)

After writing this: we did include squeeze() so a combination of indexing and explicit squeezing could indeed cause this.

Leaving any property (like .shape) undefined seems unhealthy. I'd say in this case, use ndim = None and shape is None. If dimensionality is known but exact shape isn't, use None for the unknown dimensions (e.g., shape = (3, None, 100)).

and just has its TensorShape be a tuple of either integers or None to indicate shapes which we do not know at tracing time. So I think saying shapes are tuples here is probably a good idea.

Maybe we should say "shape is tuple" and add a note that if for backwards compat reasons an implementation is using a custom object, then it should make sure that it is a subtype of tuple so that it works when users annotate code using .shape with Tuple.

@shoyer
Copy link
Contributor Author

shoyer commented Dec 17, 2020

After writing this: we did include squeeze() so a combination of indexing and explicit squeezing could indeed cause this.

This might be an argument for requiring an axis argument for squeeze(). In my opinion, that's pretty much always a good idea for library code, anyways. I've found cases where libraries use squeeze (like matplotlib's plt.subplots) to be quite annoying/inconsistent.

Leaving any property (like .shape) undefined seems unhealthy. I'd say in this case, use ndim = None and shape is None. If dimensionality is known but exact shape isn't, use None for the unknown dimensions (e.g., shape = (3, None, 100)).

Agreed!

@agarwal-ashish
Copy link

agarwal-ashish commented Dec 17, 2020

If we insist that rank is always statically known, then shape being None is moot and we can define shape to be a tuple (or tuple-like) object.

Regarding custom objects, if the object represents a union of tuple and None, then it doesn't sound correct to subclass this union from tuple: a subclass of tuple should not satisfy is None, I'd think.

TensorFlow's TensorShape object implements most of the tuple methods but is not subclassed from it since it represents an optional tuple. Instead it raises exception in certain methods when rank is unknown.

@rgommers
Copy link
Member

Regarding custom objects, if the object represents a union of tuple and None, then it doesn't sound correct to subclass this union from tuple

Agreed. I said subtype rather than subclass (as in structural subtyping). To clarify, I'd like it to be possible for users to use type annotation Tuple or Optional[Tuple] which also works with the custom object, rather than have to use some more complicated construct.

This implies that the custom object must behave tuple-like; you can index it, += it with another tuple, use len on it, etc.

@rgommers
Copy link
Member

This might be an argument for requiring an axis argument for squeeze(). In my opinion, that's pretty much always a good idea for library code, anyways. I've found cases where libraries use squeeze (like matplotlib's plt.subplots) to be quite annoying/inconsistent.

I like that idea, I'll open a PR so we can discuss there.

rgommers added a commit that referenced this issue Dec 17, 2020
As suggested by @shoyer in #97 (comment)
This makes it possible to predict resulting rank of output array, which is
otherwise undetermined (see discussion in gh-97).

Using squeeze without specifying the axis in library code often results in
unintuitive behaviour. For the common use case of extracting a scalar from a size-1 2-D array,
this gets a little more verbose (e.g. `np.squeeze(np.array([[1]]), axis=(0, 1))`),
but that's probably a price worth paying for the extra clarity.

Also changes specified behaviour for a given axis not having size 1 to raising
a `ValueError`, which is what NumPy does. This wasn't considered before, and the current
description seems simply incorrect.

Finally, this makes `squeeze` the exact inverse of `expand_dims`, which is probably
a good thing.
rgommers added a commit to rgommers/array-api that referenced this issue Dec 17, 2020
As suggested by @shoyer in data-apis#97 (comment)
This makes it possible to predict resulting rank of output array, which is
otherwise undetermined (see discussion in data-apisgh-97).

Using squeeze without specifying the axis in library code often results in
unintuitive behaviour. For the common use case of extracting a scalar from a size-1 2-D array,
this gets a little more verbose (e.g. `np.squeeze(np.array([[1]]), axis=(0, 1))`),
but that's probably a price worth paying for the extra clarity.

Also changes specified behaviour for a given axis not having size 1 to raising
a `ValueError`, which is what NumPy does. This wasn't considered before, and the current
description seems simply incorrect.

Finally, this makes `squeeze` the exact inverse of `expand_dims`, which is probably
a good thing.
rgommers added a commit to rgommers/array-api that referenced this issue Dec 17, 2020
As suggested by @shoyer in data-apis#97 (comment)
This makes it possible to predict resulting rank of output array, which is
otherwise undetermined (see discussion in data-apisgh-97).

Using squeeze without specifying the axis in library code often results in
unintuitive behaviour. For the common use case of extracting a scalar from a size-1 2-D array,
this gets a little more verbose (e.g. `np.squeeze(np.array([[1]]), axis=(0, 1))`),
but that's probably a price worth paying for the extra clarity.

Also changes specified behaviour for a given axis not having size 1 to raising
a `ValueError`, which is what NumPy does. This wasn't considered before, and the current
description seems simply incorrect.

Finally, this makes `squeeze` the exact inverse of `expand_dims`, which is probably
a good thing.
@jakirkham
Copy link
Member

The main reason Dask chose nan is that it naturally poisons anything it touches. So if something depends on adding two lengths 2 + nan, one gets nan as a result. Thus a resulting array produced from an array that has unknown lengths easily inherits it. As a result there doesn't need to be a bunch of try...except code or checks for None, things largely just work

@shoyer
Copy link
Contributor Author

shoyer commented Mar 19, 2021

Conversely, the danger of NaN is that it doesn't quite have the right semantics for array sizes, e.g., it's always False in comparisons rather than producing another "unknown" value. The ideal solution for minimal boilerplate might be a new object to denote a missing value, like pandas.NA.

rgommers added a commit that referenced this issue May 11, 2021
As suggested by @shoyer in #97 (comment)
This makes it possible to predict resulting rank of output array, which is
otherwise undetermined (see discussion in gh-97).

Using squeeze without specifying the axis in library code often results in
unintuitive behaviour. For the common use case of extracting a scalar from a size-1 2-D array,
this gets a little more verbose (e.g. `np.squeeze(np.array([[1]]), axis=(0, 1))`),
but that's probably a price worth paying for the extra clarity.

Also changes specified behaviour for a given axis not having size 1 to raising
a `ValueError`, which is what NumPy does. This wasn't considered before, and the current
description seems simply incorrect.

Finally, this makes `squeeze` the exact inverse of `expand_dims`, which is probably
a good thing.
@kgryte kgryte added this to the v2021 milestone Oct 4, 2021
@kgryte
Copy link
Contributor

kgryte commented Oct 21, 2021

To summarize the above discussion and the discussions during the consortium meetings (e.g., 7 October 2021), the path forward seems to be as follows:

  • if rank is unknown, ndim should be None and shape should be None.
  • if rank is known but dimensions are unknown, ndim should be an int and shape should be a tuple where known shape dimensions should be ints and unknown shape dimensions should be None.
  • if rank is known and dimensions are known, ndim should be an int and shape should be a tuple whose dimensions should be ints.
  • in most cases, shape should be a tuple. For those use cases where a custom object is needed, the custom object must act like a tuple.

We can consider adding a functional shape API that supports returning the (dynamic) shape of an array as an array. This would be similar to TensorFlow's tf.shape and MXNet's shape_array APIs. This API would allow returning the shape of an array as the result of delayed computation. We can push this decision to the 2022 revision of the specification.

The above would not satisfy @jakirkham's desire for behavior which poisons subsequent operations; however, as discussed here and elsewhere, NaN does not seem like the right abstraction (e.g., value equality, introducing floating-point semantics, etc). While shape arithmetic is a valid use case, we should consider ways to ensure that this is not a userland concern. For implementations, shape arithmetic can be necessary for allocation (e.g., concat, tiling, etc), but it would seem doable to mimic NaN poison behavior with None and straightforward helper functions.

A custom object as suggested by @shoyer would be doable; however, this is an abstraction which does not currently exist in or is used by array libraries (although, it does exist in dataframe libraries, such as pandas) and would be specific to each implementing array library. The advantage of None and NaN is they exist independently of any one array library.

Accordingly, I think the lowest common denominator is requiring shape return a tuple (or something tuple-like) and using None for unknown dimensions. While not perfect, this seems to me the most straightforward path atm.

@kgryte
Copy link
Contributor

kgryte commented Nov 4, 2021

Based on today's call (4 November 2021), we decided to require x.shape return a tuple--meaning the array rank upon accessing the shape attribute must be known--while still supporting unknown dimensions. The need for checking the number of dimensions is relatively common (e.g., handling special cases for size 0, 1, and 2 arrays), while checking for explicit dimension sizes is less so. So requiring x.ndim be statically known, while allowing individual dimension sizes to be unknown, seems like a reasonable balance.

For the possibility that @rgommers mentioned of explicitly squeezing a data dependent shape, array libraries could choose to error in this instance or explicitly evaluate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants