Description
Minimal Complete Verifiable Example: Consider the following -
def print_da( da ):
print( da.values )
return da
arr = xr.DataArray( np.arange( 20 ), dims = [ 'time' ] )
arr.rolling( time = 4, min_periods = 4 ) \
.construct( 'window', stride = 4 ) \
.groupby( 'time' ) \
.apply( print_da )
What you expected to happen:
[0. 1. 2. 3.]
[4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]
[16. 17. 18. 19.]
What happened: Output:
[nan nan nan 0.]
[1. 2. 3. 4.]
[5. 6. 7. 8.]
[ 9. 10. 11. 12.]
[13. 14. 15. 16.]
Calling construct on a rolling window with a stride parameter results in unexpected behavior. After reading through some of the documentation, I now understand that rolling.construct
mostly ignores the min_periods
argument -- so it may be valid to say that this is expected output. That said, it is actually impossible to achieve the windowing result I desire using the current implementation of xarray without reconstructing the whole array in memory, which leads to wasteful memory usage, and in some cases, expansion of the array beyond my machine's memory capabilities.
Anything else we need to know?:
My use case here is for a strided FFT computation along timeseries data (spectrogram) which is one of the proposed use cases of the rolling.construct
method. I desire a rolling constant-size window with strided overlap that sweeps across the data with no NaNs, resulting in any number of full windows (min_periods
= None (window size) ). I can get close to my desired result with:
arr.rolling( time = 4 ) \
.construct( 'window', stride = 4 ) \
.dropna( 'time' ) \
.groupby( 'time' ) \
.apply( print_da )
but this gives me the (expected) result:
[1. 2. 3. 4.]
[5. 6. 7. 8.]
[ 9. 10. 11. 12.]
[13. 14. 15. 16.]
Note how the first sample of the data is dropped
My synthesized data has a particular phase property that requires the first window consists of the first n samples of the data; basically I need that first sample. I can get my desired result by not using the strided keyword argument:
arr.rolling( time = 4 ) \
.construct( 'window' ) \
.dropna( 'time' ) \
.isel( time = slice( None, None, 4 ) ) \
.groupby( 'time' ) \
.apply( print_da )
But this implementation inflates the representation of the data in memory and is very very slow for obvious reasons. On my actual timeseries dataset, this method causes a memory error after inflating 200 MB dataset to several tens of gigabytes.
There are obviously other ways to achieve this result, but none of them met my needs. One possibility would be to use groupby_bins
, but that requires me to specify the number of bins I want along the axis. I actually care less about the number of bins, moreso that they're all the same size and have a consistent stride -- functionality that pointed me to rolling
. I could also use rolling.reduce
followed by a dropna
, but that method requires me to rewrite my chunked analysis method to operate only on the raw data of the array, without access to coordinates associated with that chunk. I actually find the rolling.reduce
methodology to be quite counterintuitive and would prefer a rolling.apply
method instead, but that's a separate feature request.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.15.0-115-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: Nonexarray: 0.15.1
pandas: 1.0.5
numpy: 1.18.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.19.0
distributed: None
matplotlib: 3.2.2
cartopy: None
seaborn: 0.10.1
numbagg: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: 4.8.3
pytest: None
IPython: 7.15.0
sphinx: 3.1.1