open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

crusaderky · 2018-04-05T11:32:02Z

This is a follow-up from #1521.

When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't
on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost.

My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension:

xarray.open_mfdataset('cube.*.nc', engine='h5netcdf', concat_dim='scenario')

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * attribute    (attribute) object 'THEO/Value'
    currency     (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ...
  * fx_id        (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
  * instr_id     (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
  * timestep     (timestep) datetime64[ns] 2016-12-31
    type         (instr_id) object 'American' 'Bond Future' 'Bond Future' ...
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s
Wall time: 24.4 s

If I skip loading and comparing the non-index coords from all 200 files:

xarray.open_mfdataset('cube.*.nc'), engine='h5netcdf', concat_dim='scenario', coords='all')

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * attribute    (attribute) object 'THEO/Value'
  * fx_id        (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
  * instr_id     (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
  * timestep     (timestep) datetime64[ns] 2016-12-31
    currency     (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
    type         (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 12.7 s, sys: 305 ms, total: 13 s
Wall time: 14.8 s

If I skip loading and comparing also the index coords from all 200 files:

cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube.*.nc'), engine='h5netcdf',
                             concat_dim='scenario', 
                             drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type'])

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Dimensions without coordinates: attribute, fx_id, instr_id, timestep
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s
Wall time: 9.05 s

Proposed design

Add a new optional parameter to open_mfdataset, assume_aligned=None.
It can be valued to a list of variable names or "all", and requires concat_dim to be explicitly set.
It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones.

Algorithm

Perform the first invocation to the underlying open_dataset like it happens now
if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list.
if assume_aligned != "all": drop_variables &= assume_aligned
Pass the increasingly long drop_variables list to the underlying open_dataset

The text was updated successfully, but these errors were encountered:

rabernat · 2018-04-05T14:32:05Z

I agree it would be great to have this feature.

There has already been lots discussion of this on #1385 and #1823. I tried and failed to implement something similar in #1413. I recommend reviewing those threads before jumping in to this.

TomNicholas mentioned this issue Nov 2, 2018

Concatenate across multiple dimensions with open_mfdataset #2159

Closed

TomNicholas mentioned this issue Nov 10, 2018

Feature: N-dimensional auto_combine #2553

Merged

11 tasks

dcherian mentioned this issue May 3, 2019

We need a fast path for open_mfdataset #1823

Closed

dcherian mentioned this issue Aug 1, 2019

Add join='override' #3175

Merged

3 tasks

dcherian changed the title ~~open_mfdataset to blindly trust alignment~~ open_mfdataset: skip loading for indexes and coordinates from all but the first file Sep 16, 2019

dcherian added the topic-backends label Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

crusaderky commented Apr 5, 2018

rabernat commented Apr 5, 2018

Uh oh!

Uh oh!

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Comments

crusaderky commented Apr 5, 2018

Proposed design

Algorithm

rabernat commented Apr 5, 2018

Uh oh!