Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making xarray math lazy #2298

Open
shoyer opened this issue Jul 18, 2018 · 7 comments
Open

Making xarray math lazy #2298

shoyer opened this issue Jul 18, 2018 · 7 comments

Comments

@shoyer
Copy link
Member

shoyer commented Jul 18, 2018

At SciPy, I had the realization that it would be relatively straightforward to make element-wise math between xarray objects lazy. This would let us support lazy coordinate arrays, a feature that has quite a few use-cases, e.g., for both geoscience and astronomy.

The trick would be to write a lazy array class that holds an element-wise vectorized function and passes indexers on to its arguments. I haven't thought too hard about this yet for vectorized indexing, but it could be quite efficient for outer indexing. I have some prototype code but no tests yet.

The question is how to hook this into xarray operations. In particular, supposing that the inputs to a function do no hold dask arrays:

  • Should we try to make every element-wise operation with vectorized functions (ufuncs) lazy by default? This might have negative performance implications and would be a little tricky to implement with xarray's current code, since we still implement binary operations like + with separate logic from apply_ufunc.
  • Should we make every element-wise operation that explicitly uses apply_ufunc() lazy by default?
  • Or should we only make element-wise operations lazy with apply_ufunc() if you use some special flag, e.g., apply_ufunc(..., lazy=True)?

I am leaning towards the last option for now but would welcome other opinions.

@fujiisoup
Copy link
Member

This sounds interesting.
I am curious what the practical difference from dask is.
Does it mean some maths are lazy by default (without any external library)?

@shoyer
Copy link
Member Author

shoyer commented Jul 18, 2018

The main practical difference is that it allows us to reliably guarantee that expressions like f(x, y)[i] always get evaluated like f(x[i], y[i]). Dask doesn't have this optimization yet (dask/dask#746), so indexing operations still compute the function f() on each block of an array. This issue provides full context from the xarray side: #1725

The typical example is spatially referenced imagery, e.g., a 2D satellite photo of the surface of the Earth with 2D latitude/longitude coordinates associated with each point. It would be very expensive to store full latitude and longitude arrays, but fortunately they can usually be computed cheaply from row and column indices.

Ideally, this logic would live outside xarray. But it's important enough to some xarray users (especially geoscience + astronomy) and we have enough related functionality (e.g., for lazy and explicit indexing) that it probably makes sense to add it.

@fujiisoup
Copy link
Member

Thanks, @shoyer

we have enough related functionality (e.g., for lazy and explicit indexing)

Agreed.
Actually, it sounds very fun to code the lazy arithmetics.

Ideally, this logic would live outside xarray.

Yes, I concerned about this.
We have discussed to support more kinds of array-likes (e.g. dask, sparse, cupy) in #1938,
and I thought the lazy array can be (ideally) one of them.

But in practice, it should take a long time to realize the any-array-like support and it might be a good idea to natively support the lazy mathematics for now.
If we are heading to any-array-like support, I think that the implementation of the lazy array should be as isolated from xarray core logic as possible so that we can move smoothly to the any-array-like support in the future.

Therefore, personally, I'd like to see this lazy math by implementing a lazy array.
The API I thought of is .to_lazy() which converts the backend to the lazy array,
as similar to that .chunk() converts the backend to dask array.

@shoyer
Copy link
Member Author

shoyer commented Jul 20, 2018

Therefore, personally, I'd like to see this lazy math by implementing a lazy array.
The API I thought of is .to_lazy() which converts the backend to the lazy array,
as similar to that .chunk() converts the backend to dask array.

This is not a bad idea, but the version of lazy arithmetic that I have been contemplating (see #2302) is not yet complete. For example, it doesn't have any way to represent a lazy aggregation.

@mrocklin
Copy link
Contributor

Two thoughts:

  1. We can push some of this into Dask with Fuse array elementwise operations at graph build time dask/dask#2538
  2. The full lazy ndarray solution would be a good application of the __array_function__ protocol

@shoyer
Copy link
Member Author

shoyer commented Jul 20, 2018

Indeed, I really like the look of dask/dask#2538 and its implementation in dask/dask#2608. It doesn't solve the indexing optimization yet but that could be pretty straightforward to add -- especially once we add a notion of explicit indexing types (basic vs outer vs vectorized) directly into dask.

@max-sixty
Copy link
Collaborator

Any thoughts on the current status on this?

@dcherian dcherian added topic-lazy array and removed topic-arrays related to flexible array support labels Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants