Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for N-dimensional combine #2616

Merged
merged 111 commits into from
Jun 25, 2019
Merged
Show file tree
Hide file tree
Changes from 59 commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
88ee12a
concatenates along a single dimension
TomNicholas Nov 5, 2018
1aaa075
Wrote function to find correct tile_IDs from nested list of datasets
TomNicholas Nov 6, 2018
dbb371d
Wrote function to check that combined_tile_ids structure is valid
TomNicholas Nov 7, 2018
cc4d743
Added test of 2d-concatenation
TomNicholas Nov 7, 2018
d2fc7e7
Tests now check that dataset ordering is correct
TomNicholas Nov 8, 2018
e3f3699
Test concatentation along a new dimension
TomNicholas Nov 8, 2018
55bf685
Started generalising auto_combine to N-D by integrating the N-D conca…
TomNicholas Nov 9, 2018
845206c
All unit tests now passing
TomNicholas Nov 9, 2018
fb66626
Merge branch 'real_master' into feature/nd_combine
TomNicholas Nov 10, 2018
f4e9aad
Fixed a failing test which I didn't notice because I don't have pseud…
TomNicholas Nov 10, 2018
00004a1
Began updating open_mfdataset to handle N-D input
TomNicholas Nov 14, 2018
b41e374
Refactored to remove duplicate logic in open_mfdataset & auto_combine
TomNicholas Nov 14, 2018
8672a79
Implemented Shoyers suggestion in #2553 to rewrite the recursive nest…
TomNicholas Nov 14, 2018
4f56b24
--amend
TomNicholas Nov 14, 2018
4cfaf2e
Now raises ValueError if input not ordered correctly before concatena…
TomNicholas Nov 14, 2018
9fd1413
Added some more prototype tests defining desired behaviour more clearly
TomNicholas Nov 22, 2018
8ad0121
Now raises informative errors on invalid forms of input
TomNicholas Nov 24, 2018
4b2c544
Refactoring to alos merge along each dimension
TomNicholas Nov 25, 2018
3d0061e
Refactored to literally just apply the old auto_combine along each di…
TomNicholas Nov 25, 2018
60c93ba
Added unit tests for open_mfdatset
TomNicholas Nov 26, 2018
1824538
Removed TODOs
TomNicholas Nov 26, 2018
d380815
Removed format strings
TomNicholas Nov 30, 2018
c4bb8d0
test_get_new_tile_ids now doesn't assume dicts are ordered
TomNicholas Nov 30, 2018
6b7f889
Fixed failing tests on python3.5 caused by accidentally assuming dict…
TomNicholas Nov 30, 2018
58a3648
Test for getting new tile id
TomNicholas Nov 30, 2018
a12a34a
Fixed itertoolz import so that it's compatible with older versions
TomNicholas Nov 30, 2018
ada1f4a
Increased test coverage
TomNicholas Dec 1, 2018
ef0a30e
Added toolz as an explicit dependency to pass tests on python2.7
TomNicholas Dec 1, 2018
3be70bc
Updated 'what's new'
TomNicholas Dec 1, 2018
f266bc3
No longer attempts to shortcut all concatenation at once if concat_di…
TomNicholas Dec 1, 2018
cf49c2b
Merge branch 'master' into feature/nd_combine
TomNicholas Dec 1, 2018
878e1f9
Rewrote using itertools.groupby instead of toolz.itertoolz.groupby to…
TomNicholas Dec 1, 2018
7dea14f
Merged changes from master
TomNicholas Dec 1, 2018
e6f25a3
Fixed erroneous removal of utils import
TomNicholas Dec 1, 2018
f856485
Updated docstrings to include an example of multidimensional concaten…
TomNicholas Dec 2, 2018
6305d83
Clarified auto_combine docstring for N-D behaviour
TomNicholas Dec 5, 2018
ce59da1
Added unit test for nested list of Datasets with different variables
TomNicholas Dec 10, 2018
9fb34cf
Minor spelling and pep8 fixes
TomNicholas Dec 10, 2018
83dedb3
Started working on a new api with both auto_combine and manual_combine
TomNicholas Dec 11, 2018
de199a0
Merged master
TomNicholas Dec 17, 2018
3e64a83
Wrote basic function to infer concatenation order from coords.
TomNicholas Jan 3, 2019
963c794
Attempt at finalised version of public-facing API.
TomNicholas Jan 4, 2019
1a66530
No longer uses entire old auto_combine internally, only concat or merge
TomNicholas Jan 4, 2019
38d265e
Merged v0.11.1 and v0.11.2 changes
TomNicholas Jan 4, 2019
7525b23
Updated what's new
TomNicholas Jan 4, 2019
92e120a
Removed uneeded addition to what's new for old release
TomNicholas Jan 4, 2019
13a7f75
Fixed incomplete merge in docstring for open_mfdataset
TomNicholas Jan 4, 2019
b76e681
Tests for manual combine passing
TomNicholas Jan 6, 2019
c09df8b
Tests for auto_combine now passing
TomNicholas Jan 6, 2019
953d572
xfailed weird behaviour with manual_combine trying to determine conca…
TomNicholas Jan 6, 2019
b7bf1ad
Add auto_combine and manual_combine to API page of docs
TomNicholas Jan 6, 2019
855d819
Tests now passing for open_mfdataset
TomNicholas Jan 6, 2019
de7965e
Attempted to merge master in, but #2648 has stumped me
TomNicholas Jan 6, 2019
bfcb4e3
Completed merge so that #2648 is respected, and added tests.
TomNicholas Jan 7, 2019
eb053cc
Separated the tests for concat and both combines
TomNicholas Jan 7, 2019
97e508c
Some PEP8 fixes
TomNicholas Jan 7, 2019
410b138
Pre-empting a test which will fail with opening uamiv format
TomNicholas Jan 7, 2019
02b6d05
Satisfy pep8speaks bot
TomNicholas Jan 7, 2019
0d6f13a
Python 3.5 compatibile after changing some error string formatting
TomNicholas Jan 7, 2019
18e0074
Order coords using pandas.Index objects
TomNicholas Jan 7, 2019
67f11f3
Fixed performance bug from GH #2662
TomNicholas Jan 15, 2019
3b843f5
Removed ToDos about natural sorting of string coords
TomNicholas Jan 23, 2019
540d3d4
Merged master into branch
TomNicholas Jan 23, 2019
bb98d54
Generalized auto_combine to handle monotonically-decreasing coords too
TomNicholas Jan 24, 2019
e3f7523
Added more examples to docstring for manual_combine
TomNicholas Jan 28, 2019
fc36b74
Merged master - includes py2 deprecation
TomNicholas Jan 28, 2019
d96595e
Added note about globbing aspect of open_mfdataset
TomNicholas Jan 28, 2019
79f09c0
Removed auto-inferring of concatenation dimension in manual_combine
TomNicholas Jan 28, 2019
e32adb3
Added example to docstring for auto_combine
TomNicholas Jan 28, 2019
da4d605
Minor correction to docstring
TomNicholas Jan 28, 2019
c4fe22c
Another very minor docstring correction
TomNicholas Jan 28, 2019
66b4c4f
Added test to guard against issue #2777
TomNicholas Feb 27, 2019
90f0c1d
Started deprecation cycle for auto_combine
TomNicholas Mar 2, 2019
0990dd4
Fully reverted open_mfdataset tests
TomNicholas Mar 3, 2019
d6277be
Updated what's new to match deprecation cycle
TomNicholas Mar 3, 2019
b81e77a
Merge branch 'real_master' into feature/nd_combine_new_api
TomNicholas Mar 3, 2019
bf7d549
Reverted uamiv test
TomNicholas Mar 3, 2019
f00770f
Removed dependency on itertools
TomNicholas Mar 3, 2019
c7c1746
Deprecation tests fixed
TomNicholas Mar 3, 2019
f6192ca
Satisfy pycodestyle
TomNicholas Mar 3, 2019
88f089e
Started deprecation cycle of auto_combine
TomNicholas Mar 18, 2019
2849559
merged changes from master for v0.12
TomNicholas Mar 18, 2019
535bc31
Added specific error for edge case combine_manual can't handle
TomNicholas Mar 18, 2019
5d818e0
Check that global coordinates are monotonic
TomNicholas Mar 18, 2019
42cd05d
Highlighted weird behaviour when concatenating with no data variables
TomNicholas Mar 18, 2019
8a83814
Added test for impossible-to-auto-combine coordinates
TomNicholas Mar 18, 2019
e4acbdc
Removed uneeded test
TomNicholas Mar 18, 2019
8e767e2
Satisfy linter
TomNicholas Mar 18, 2019
3d04112
Added airspeedvelocity benchmark for combining functions
TomNicholas Mar 18, 2019
06ecef6
Benchmark will take longer now
TomNicholas Mar 18, 2019
513764f
Updated version numbers in deprecation warnings to fit with recent re…
TomNicholas Mar 18, 2019
13364ff
Updated api docs for new function names
TomNicholas May 18, 2019
ddfc6dd
Fixed docs build failure
TomNicholas May 18, 2019
e471a42
Revert "Fixed docs build failure"
TomNicholas May 19, 2019
2d5b90f
Updated documentation with section explaining new functions
TomNicholas May 19, 2019
8cbf5e1
Merged master
TomNicholas May 19, 2019
9ead34e
Suppressed deprecation warnings in test suite
TomNicholas May 20, 2019
fab3586
Resolved ToDo by pointing to issue with concat, see #2975
TomNicholas May 20, 2019
9d5e29f
Various docs fixes
TomNicholas May 20, 2019
9a33ac6
Merged master, resolving conflicts with #2964
TomNicholas May 28, 2019
ae7b811
Slightly renamed tests to match new name of tested function
TomNicholas May 28, 2019
f4fc03d
Included minor suggestions from shoyer
TomNicholas May 28, 2019
917ebee
Removed trailing whitespace
TomNicholas May 28, 2019
1e537ba
Simplified error message for case combine_manual can't handle
TomNicholas May 29, 2019
7d6845b
Removed filter for deprecation warnings, and added test for if user d…
TomNicholas May 29, 2019
5083471
Simple fixes suggested by shoyer
TomNicholas Jun 21, 2019
4cc70ae
Change deprecation warning behaviour
TomNicholas Jun 21, 2019
537c405
Merged in recent changes to master
TomNicholas Jun 21, 2019
2f54127
Merge branch 'master' into feature/nd_combine_new_api
dcherian Jun 25, 2019
357531f
linting
TomNicholas Jun 25, 2019
e006875
Merge branch 'feature/nd_combine_new_api' of https://github.com/TomNi…
TomNicholas Jun 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ Top-level functions
broadcast
concat
merge
auto_combine
manual_combine
where
set_options
full_like
Expand Down
17 changes: 17 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,23 @@ Breaking changes
Python 3 only. (:issue:`1876`).
By `Joe Hamman <https://github.com/jhamman>`_.


- Combining datasets along N dimensions:

- ``open_mfdataset`` and ``auto_combine`` can now combine datasets along any
number of dimensions, instead of just a one-dimensional list of datasets.

If the datasets have monotonic global dimension coordinates then the new
``auto_combine`` should be used. If not then the new ``manual_combine``
will accept the datasets as a a nested list-of-lists, and combine by
applying a series of concat and merge operations.

Breaking because some lists that were previously valid inputs to
``open_mfdataset`` and ``auto_combine`` may no longer be valid, and should
now be combined explicitly using ``manual_combine`` instead.
(:issue:`2159`) By `Tom Nicholas <http://github.com/TomNicholas>`_.


Enhancements
~~~~~~~~~~~~

Expand Down
3 changes: 2 additions & 1 deletion xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@

from .core.alignment import align, broadcast, broadcast_arrays
from .core.common import full_like, zeros_like, ones_like
from .core.combine import concat, auto_combine
from .core.concat import concat
from .core.combine import auto_combine, manual_combine
from .core.computation import apply_ufunc, dot, where
from .core.extensions import (register_dataarray_accessor,
register_dataset_accessor)
Expand Down
119 changes: 64 additions & 55 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,13 @@

import numpy as np

from .. import Dataset, backends, conventions
from .. import Dataset, DataArray, backends, conventions
from ..core import indexing
from ..core.combine import (
_CONCAT_DIM_DEFAULT, _auto_combine, _infer_concat_order_from_positions)
from .. import auto_combine
from ..core.combine import (_manual_combine, _CONCAT_DIM_DEFAULT,
_infer_concat_order_from_positions)
from ..core.pycompat import basestring, path_type
from ..core.utils import close_on_error, is_grib_path, is_remote_uri
from ..core.utils import (close_on_error, is_grib_path, is_remote_uri)
from .common import ArrayWriter
from .locks import _get_scheduler

Expand Down Expand Up @@ -487,35 +488,42 @@ def close(self):
def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
compat='no_conflicts', preprocess=None, engine=None,
lock=None, data_vars='all', coords='different',
autoclose=None, parallel=False, **kwargs):
combine='auto', autoclose=None, parallel=False, **kwargs):
"""Open multiple files as a single dataset.

If combine='auto' then the function `auto_combine` is used to combine the
datasets into one before returning the result, and if combine='manual' then
`manual_combine` is used. The filepaths must be structured according to
which combining function is used, the details of which are given in the
documentation for ``auto_combine`` and ``manual_combine``.
Requires dask to be installed. See documentation for details on dask [1].
Attributes from the first dataset file are used for the combined dataset.

Parameters
----------
paths : str or sequence
Either a string glob in the form "path/to/my/files/*.nc" or an explicit
list of files to open. Paths can be given as strings or as pathlib
Paths.
list of files to open. Paths can be given as strings or as pathlib
Paths. If concatenation along more than one dimension is desired, then
``paths`` must be a nested list-of-lists (see ``manual_combine`` for
details).
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
chunks : int or dict, optional
Dictionary with keys given by dimension names and values given by chunk
sizes. In general, these should divide the dimensions of each dataset.
If int, chunk each dimension by ``chunks``.
By default, chunks will be chosen to load entire input files into
memory at once. This has a major impact on performance: please see the
full documentation for more details [2].
concat_dim : None, str, DataArray or Index, optional
Dimension to concatenate files along. This argument is passed on to
:py:func:`xarray.auto_combine` along with the dataset objects. You only
need to provide this argument if the dimension along which you want to
concatenate is not a dimension in the original datasets, e.g., if you
want to stack a collection of 2D arrays along a third dimension.
concat_dim : str, or list of str, DataArray, Index or None, optional
Dimensions to concatenate files along. You only
need to provide this argument if any of the dimensions along which you
want to concatenate is not a dimension in the original datasets, e.g.,
if you want to stack a collection of 2D arrays along a third dimension.
By default, xarray attempts to infer this argument by examining
component files. Set ``concat_dim=None`` explicitly to disable
concatenation.
compat : {'identical', 'equals', 'broadcast_equals', 'no_conflicts'}, optional
component files. Set ``concat_dim=[..., None, ...]`` explicitly to
disable concatenation along a particular dimension.
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
compat : {'identical', 'equals', 'broadcast_equals',
'no_conflicts'}, optional
String indicating how to compare variables of the same name for
potential conflicts when merging:
* 'broadcast_equals': all values must be equal when variables are
Expand All @@ -542,20 +550,18 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
active dask scheduler.
data_vars : {'minimal', 'different', 'all' or list of str}, optional
These data variables will be concatenated together:

* 'minimal': Only data variables in which the dimension already
appears are included.
* 'different': Data variables which are not equal (ignoring
attributes) across all datasets are also concatenated (as well as
all for which dimension already appears). Beware: this option may
load the data payload of data variables into memory if they are not
already loaded.
* 'all': All data variables will be concatenated.
* list of str: The listed data variables will be concatenated, in
addition to the 'minimal' data variables.
coords : {'minimal', 'different', 'all' o list of str}, optional
* 'minimal': Only data variables in which the dimension already
appears are included.
* 'different': Data variables which are not equal (ignoring
attributes) across all datasets are also concatenated (as well as
all for which dimension already appears). Beware: this option may
load the data payload of data variables into memory if they are not
already loaded.
* 'all': All data variables will be concatenated.
* list of str: The listed data variables will be concatenated, in
addition to the 'minimal' data variables.
coords : {'minimal', 'different', 'all' or list of str}, optional
These coordinate variables will be concatenated together:

* 'minimal': Only coordinates in which the dimension already appears
are included.
* 'different': Coordinates which are not equal (ignoring attributes)
Expand All @@ -570,6 +576,9 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
parallel : bool, optional
If True, the open and preprocess steps of this function will be
performed in parallel using ``dask.delayed``. Default is False.
combine : {'auto', 'manual'}, optional
Whether ``xarray.auto_combine`` or ``xarray.manual_combine`` is used to
combine all the data. Default is 'auto'.
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
**kwargs : optional
Additional arguments passed on to :py:func:`xarray.open_dataset`.

Expand All @@ -580,6 +589,7 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
See Also
--------
auto_combine
manual_combine
open_dataset

References
Expand All @@ -601,22 +611,15 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
if not paths:
raise IOError('no files to open')

# Coerce 1D input into ND to maintain backwards-compatible API until API
# for N-D combine decided
# (see https://github.com/pydata/xarray/pull/2553/#issuecomment-445892746)
if concat_dim is None or concat_dim is _CONCAT_DIM_DEFAULT:
concat_dims = concat_dim
elif not isinstance(concat_dim, list):
concat_dims = [concat_dim]
else:
concat_dims = concat_dim
infer_order_from_coords = False

# If infer_order_from_coords=True then this is unnecessary, but quick.
# If infer_order_from_coords=False then this creates a flat list which is
# easier to iterate over, while saving the originally-supplied structure
# If combine='auto' then this is unnecessary, but quick.
# If combine='manual' then this creates a flat list which is easier to
# iterate over, while saving the originally-supplied structure as "ids"
if combine is 'manual':
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
if concat_dim is not _CONCAT_DIM_DEFAULT:
if isinstance(concat_dim, (str, DataArray)) or concat_dim is None:
concat_dim = [concat_dim]
combined_ids_paths, concat_dims = _infer_concat_order_from_positions(
paths, concat_dims)
paths, concat_dim)
ids, paths = (
list(combined_ids_paths.keys()), list(combined_ids_paths.values()))

Expand Down Expand Up @@ -644,18 +647,24 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
# the underlying datasets will still be stored as dask arrays
datasets, file_objs = dask.compute(datasets, file_objs)

# Close datasets in case of a ValueError
# Combine all datasets, closing them in case of a ValueError
try:
if infer_order_from_coords:
# Discard ordering because it should be redone from coordinates
ids = False

combined = _auto_combine(
datasets, concat_dims=concat_dims,
compat=compat,
data_vars=data_vars, coords=coords,
infer_order_from_coords=infer_order_from_coords,
ids=ids)
if combine is 'auto':
# Will redo ordering from coordinates, ignoring how they were
# ordered previously
if concat_dim is not _CONCAT_DIM_DEFAULT:
raise ValueError("Cannot specify dimensions to concatenate "
"along when auto-combining")

combined = auto_combine(datasets, compat=compat,
data_vars=data_vars, coords=coords)

else:
# Combined nested list by successive concat and merge operations
# along each dimension, using structure given by "ids"
combined = _manual_combine(datasets, concat_dims=concat_dim,
compat=compat, data_vars=data_vars,
coords=coords, ids=ids)
except ValueError:
for ds in datasets:
ds.close()
Expand Down
Loading