Lazy label-based .isel() using 1-D Boolean array, followed by .load() is very slow #6413

raj-magesh · 2022-03-26T07:31:00Z

What is your issue?

Info about my dataset

I have a large (~20 GB, ~27,000 x ~300,000, int16) netCDF-4 file written to disk incrementally along the first (unlimited) dimension without using dask (using code adapted from this comment). The DataArray stored in this file also has ~50 coordinates along the first dimension and ~300 coordinates along the second dimension.

Trying to load a subset of the data into memory

I have a 1D Boolean mask my_mask (with ~15,000 True) along the second dimension of the array that I'd like to use to index my array. When I do the following, the operation is very slow (I haven't seen it complete):

import xarray as xr
x = xr.open_dataarray(path_to_file)
x = x.isel({"second_dim": my_mask})
x = x.load()

However, I can load the entire array and then index (this is slow-ish, but works):

import xarray as xr
x = xr.load_dataarray(path_to_file)
x = x.isel({"second_dim": my_mask})

Is this vectorized indexing?

I'm not sure if this is expected behavior: according to the Tip here in the User Guide, indexing is slow when using vectorized indexing, which I assumed to mean indexing along multiple dimensions (outer indexing, in numpy parlance). Is indexing using a 1D Boolean mask (or equivalently a 1D integer array) also slow?

What to do for larger datasets that don't fit in RAM?

Right now, I can load and then isel because my array fits in RAM. I have other datasets that don't fit in RAM: how would you recommend I load a subset of such data from disk?

In the event that I have to use dask, I will be writing along the first dimension (and hence chunking along that dimension, probably), and reading along the second dimension: is that going to be efficient (or at least more efficient than whatever xarray is doing sans dask)?

The text was updated successfully, but these errors were encountered:

max-sixty · 2023-11-06T06:10:00Z

(trying to clear up old issues, keep us below 1K)

This needs an MCVE, and maybe better suited as a usage question in Discussions

raj-magesh added the needs triage Issue that has not been reviewed by xarray team member label Mar 26, 2022

max-sixty closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy label-based .isel() using 1-D Boolean array, followed by .load() is very slow #6413

Lazy label-based .isel() using 1-D Boolean array, followed by .load() is very slow #6413

raj-magesh commented Mar 26, 2022

max-sixty commented Nov 6, 2023

Lazy label-based .isel() using 1-D Boolean array, followed by .load() is very slow #6413

Lazy label-based .isel() using 1-D Boolean array, followed by .load() is very slow #6413

Comments

raj-magesh commented Mar 26, 2022

What is your issue?

Info about my dataset

Trying to load a subset of the data into memory

Is this vectorized indexing?

What to do for larger datasets that don't fit in RAM?

max-sixty commented Nov 6, 2023