Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy label-based .isel() using 1-D Boolean array, followed by .load() is very slow #6413

Closed
raj-magesh opened this issue Mar 26, 2022 · 1 comment
Labels
needs triage Issue that has not been reviewed by xarray team member

Comments

@raj-magesh
Copy link

What is your issue?

Info about my dataset

I have a large (~20 GB, ~27,000 x ~300,000, int16) netCDF-4 file written to disk incrementally along the first (unlimited) dimension without using dask (using code adapted from this comment). The DataArray stored in this file also has ~50 coordinates along the first dimension and ~300 coordinates along the second dimension.

Trying to load a subset of the data into memory

I have a 1D Boolean mask my_mask (with ~15,000 True) along the second dimension of the array that I'd like to use to index my array. When I do the following, the operation is very slow (I haven't seen it complete):

import xarray as xr
x = xr.open_dataarray(path_to_file)
x = x.isel({"second_dim": my_mask})
x = x.load()

However, I can load the entire array and then index (this is slow-ish, but works):

import xarray as xr
x = xr.load_dataarray(path_to_file)
x = x.isel({"second_dim": my_mask})

Is this vectorized indexing?

I'm not sure if this is expected behavior: according to the Tip here in the User Guide, indexing is slow when using vectorized indexing, which I assumed to mean indexing along multiple dimensions (outer indexing, in numpy parlance). Is indexing using a 1D Boolean mask (or equivalently a 1D integer array) also slow?

What to do for larger datasets that don't fit in RAM?

Right now, I can load and then isel because my array fits in RAM. I have other datasets that don't fit in RAM: how would you recommend I load a subset of such data from disk?

In the event that I have to use dask, I will be writing along the first dimension (and hence chunking along that dimension, probably), and reading along the second dimension: is that going to be efficient (or at least more efficient than whatever xarray is doing sans dask)?

@raj-magesh raj-magesh added the needs triage Issue that has not been reviewed by xarray team member label Mar 26, 2022
@max-sixty
Copy link
Collaborator

(trying to clear up old issues, keep us below 1K)

This needs an MCVE, and maybe better suited as a usage question in Discussions

@max-sixty max-sixty closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

2 participants