Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_mfdataset: skip loading for indexes and coordinates from all but the first file #2039

Open
crusaderky opened this issue Apr 5, 2018 · 1 comment

Comments

@crusaderky
Copy link
Contributor

This is a follow-up from #1521.

When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't
on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost.

My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the "scenario" dimension:

xarray.open_mfdataset('cube.*.nc', engine='h5netcdf', concat_dim='scenario')

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * attribute    (attribute) object 'THEO/Value'
    currency     (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ...
  * fx_id        (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
  * instr_id     (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
  * timestep     (timestep) datetime64[ns] 2016-12-31
    type         (instr_id) object 'American' 'Bond Future' 'Bond Future' ...
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s
Wall time: 24.4 s

If I skip loading and comparing the non-index coords from all 200 files:

xarray.open_mfdataset('cube.*.nc'), engine='h5netcdf', concat_dim='scenario', coords='all')

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * attribute    (attribute) object 'THEO/Value'
  * fx_id        (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ...
  * instr_id     (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ...
  * timestep     (timestep) datetime64[ns] 2016-12-31
    currency     (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
    type         (scenario, instr_id) object dask.array<shape=(500001, 10765), chunksize=(2501, 10765)>
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 12.7 s, sys: 305 ms, total: 13 s
Wall time: 14.8 s

If I skip loading and comparing also the index coords from all 200 files:

cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube.*.nc'), engine='h5netcdf',
                             concat_dim='scenario', 
                             drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type'])

<xarray.Dataset>
Dimensions:      (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1)
Coordinates:
  * scenario     (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ...
Dimensions without coordinates: attribute, fx_id, instr_id, timestep
Data variables:
    FX           (fx_id, timestep, scenario) float64 dask.array<shape=(40, 1, 500001), chunksize=(40, 1, 2501)>
    instruments  (instr_id, attribute, timestep, scenario) float64 dask.array<shape=(10765, 1, 1, 500001), chunksize=(10765, 1, 1, 2501)>

CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s
Wall time: 9.05 s

Proposed design

Add a new optional parameter to open_mfdataset, assume_aligned=None.
It can be valued to a list of variable names or "all", and requires concat_dim to be explicitly set.
It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones.

Algorithm

  1. Perform the first invocation to the underlying open_dataset like it happens now
  2. if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list.
  3. if assume_aligned != "all": drop_variables &= assume_aligned
  4. Pass the increasingly long drop_variables list to the underlying open_dataset
@rabernat
Copy link
Contributor

rabernat commented Apr 5, 2018

I agree it would be great to have this feature.

There has already been lots discussion of this on #1385 and #1823. I tried and failed to implement something similar in #1413. I recommend reviewing those threads before jumping in to this.

@dcherian dcherian mentioned this issue Aug 1, 2019
3 tasks
@dcherian dcherian changed the title open_mfdataset to blindly trust alignment open_mfdataset: skip loading for indexes and coordinates from all but the first file Sep 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants