You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a large (~20 GB, ~27,000 x ~300,000, int16) netCDF-4 file written to disk incrementally along the first (unlimited) dimension without using dask (using code adapted from this comment). The DataArray stored in this file also has ~50 coordinates along the first dimension and ~300 coordinates along the second dimension.
Trying to load a subset of the data into memory
I have a 1D Boolean mask my_mask (with ~15,000True) along the second dimension of the array that I'd like to use to index my array. When I do the following, the operation is very slow (I haven't seen it complete):
I'm not sure if this is expected behavior: according to the Tip here in the User Guide, indexing is slow when using vectorized indexing, which I assumed to mean indexing along multiple dimensions (outer indexing, in numpy parlance). Is indexing using a 1D Boolean mask (or equivalently a 1D integer array) also slow?
What to do for larger datasets that don't fit in RAM?
Right now, I can load and then isel because my array fits in RAM. I have other datasets that don't fit in RAM: how would you recommend I load a subset of such data from disk?
In the event that I have to use dask, I will be writing along the first dimension (and hence chunking along that dimension, probably), and reading along the second dimension: is that going to be efficient (or at least more efficient than whatever xarray is doing sans dask)?
The text was updated successfully, but these errors were encountered:
What is your issue?
Info about my dataset
I have a large (~20 GB,
~27,000 x ~300,000
,int16
) netCDF-4 file written to disk incrementally along the first (unlimited) dimension without usingdask
(using code adapted from this comment). The DataArray stored in this file also has~50
coordinates along the first dimension and~300
coordinates along the second dimension.Trying to load a subset of the data into memory
I have a 1D Boolean mask
my_mask
(with~15,000
True
) along the second dimension of the array that I'd like to use to index my array. When I do the following, the operation is very slow (I haven't seen it complete):However, I can load the entire array and then index (this is slow-ish, but works):
Is this vectorized indexing?
I'm not sure if this is expected behavior: according to the Tip here in the User Guide, indexing is slow when using vectorized indexing, which I assumed to mean indexing along multiple dimensions (outer indexing, in
numpy
parlance). Is indexing using a 1D Boolean mask (or equivalently a 1D integer array) also slow?What to do for larger datasets that don't fit in RAM?
Right now, I can
load
and thenisel
because my array fits in RAM. I have other datasets that don't fit in RAM: how would you recommend I load a subset of such data from disk?In the event that I have to use
dask
, I will be writing along the first dimension (and hence chunking along that dimension, probably), and reading along the second dimension: is that going to be efficient (or at least more efficient than whateverxarray
is doing sansdask
)?The text was updated successfully, but these errors were encountered: