Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lazy concatenation *without dask* #6588

Closed
rabernat opened this issue May 10, 2022 · 2 comments
Closed

Support lazy concatenation *without dask* #6588

rabernat opened this issue May 10, 2022 · 2 comments

Comments

@rabernat
Copy link
Contributor

Is your feature request related to a problem?

Right now, if I want to concatenate multiple datasets (e.g. as in open_mfdataset), I have two options:

  • Eagerly load the data as numpy arrays ➡️ xarray will dispatch to np.concatenate
  • Chunk each dataset ➡️ xarray will dispatch to dask.array.concatenate

In pseudocode:

ds1 = xr.open_dataset("some_big_lazy_source_1.nc")
ds2 = xr.open_dataset("some_big_lazy_source_2.nc")
item1 = ds1.foo[0, 0, 0]  # lazily access a single item
ds = xr.concat([ds1.chunk(), ds2.chunk()], "time")  # only way to lazily concat
# trying to access the same item will now trigger loading of all of ds1
item1 = ds.foo[0, 0, 0]
# yes I could use different chunks, but the point is that I should not have to 
# arbitrarily choose chunks to make this work

However, I am increasingly encountering scenarios where I would like to lazily concatenate datasets (without loading into memory), but also without the requirement of using dask. This would be useful, for example, for creating composite datasets that point back to an OpenDAP server, preserving the possibility of granular lazy access to any array element without the requirement of arbitrary chunking at an intermediate stage.

Describe the solution you'd like

I propose to extend our LazilyIndexedArray classes to support simple concatenation and stacking. The result of applying concat to such arrays will be a new LazilyIndexedArray that wraps the underlying arrays into a single object.

The main difficulty in implementing this will probably be with indexing: the concatenated array will need to understand how to map global indexes to the underling individual array indexes. That is a little tricky but eminently solvable.

Describe alternatives you've considered

The alternative is to structure your code in a way that avoids needing to lazily concatenate arrays. That is what we do now. It is not optimal.

Additional context

No response

@dcherian
Copy link
Contributor

Thanks @rabernat. Closing as duplicate of #4628

@rabernat
Copy link
Contributor Author

Oops sorry for the duplicate issue! 🤦

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants