-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concatenate across multiple dimensions with open_mfdataset #2159
Comments
Thanks for opening up this issue. This would be very helpful for the forecasting community as well, where we usually concatenate along Start time and Lead time dimensions. |
Since you linked to my SO answer, I will add that I think it is quite possible for us to develop this functionality in xarray. My view is that it will take a concerted effort by an interested developer to come up with an approach to do this but it is possible. Also, I seem to remember seeing this topic before in our issue tracker but I'm not finding it now. |
I agree with @jhamman that it would take effort from an interested developer to do this but in principle it's quite doable. I think our logic in auto_combine (which powers open_mfdataset) could probably be extended to handle concatenation across multiple dimensions. The main implementation would need to look at coordinates along concatenated dimensions to break the operation into multiple calls |
Another suggestion: as one of the obvious uses for this is in collecting the output from parallelized simulations, which always have ghost cells around the domain each processor computes on, would it be worth adding an option to throw those away as the mf dataset is loaded? Or is that a task better dealt with by slicing the resultant array after the fact? |
@TomNicholas I think you could use the existing |
@shoyer At the risk of going off on a tangent - I think that only works if the number of guard cells you want to remove can be determined from the data in the dataset you're loading, because preprocess doesn't accept any further arguments. For example, say you want to remove all ghost cells except the ones at the edge of your simulation domain. If there's no information in each dataset which marks it as a dataset containing a simulation boundary region, then the preprocess function can't know to treat it differently without further arguments. I might be wrong though? |
Just wanted to add the same request;)
I also do not understand what is the real complexity of implementing it. |
@aluhamaa I don't think you're missing anything here. I agree that it would be pretty straightforward, it just would take a bit of work. |
👍 for this feature |
I just had the exact same problem, and while I didn't yet have time to dig into the source code of https://gist.github.com/jnhansen/fa474a536201561653f60ea33045f4e2 Maybe it's helpful to some of you. Note that I make the following assumptions (which are reasonable for my use case):
|
Thanks @jnhansen ! I actually ended up writing my own, much lower level, version of this using the netcdf library. The reason I did that was because I was finding it hard to work out how to merge multiple datasets, then write the data out to a new netcdf file in chunks - I kept accidentally loading the entire merged dataset into memory at once. This might just be because I wasn't using the dask integration properly though. Have you tried using your function to merge netcdf files, then write out a single file which is larger than RAM? Is that even possible in xarray? |
Yes, ds = auto_merge('*.nc')
ds.to_netcdf('larger_than_memory.nc')
I tested it on a ~25 GB dataset (on a machine with less memory than that). Note: |
I've been looking through the functions The current behaviour isn't completely explicit, and I would like to check my understanding with a few questions:
grouped = itertoolz.groupby(lambda ds: tuple(sorted(ds.data_vars)), datasets).values() will only organise the datasets into groups according to the set of dimensions they have, it doesn't order the datasets within each group according to the values in the dimension coordinates? We can show this because this (new) testcase fails: @requires_dask
def test_auto_combine_along_coords(self):
# drop the third dimension to keep things relatively understandable
data = create_test_data()
for k in list(data.variables):
if 'dim3' in data[k].dims:
del data[k]
data_split1 = data.isel(dim2=slice(4))
data_split2 = data.isel(dim2=slice(4, None))
split_data = [data_split2, data_split1] # Deliberately arrange datasets in wrong order
assert_identical(data, auto_combine(split_data, 'dim2')) with output
concatenated = [_auto_concat(ds, dim=dim, data_vars=data_vars, coords=coords) for ds in grouped]
Also, # User specifies how they split up their domain
domain_decomposition_structure = how_was_this_parallelized('output.*.nc')
# Feeds this info into open_mfdataset
full_domain = xr.open_mfdataset('output.*.nc', positions=domain_decomposition_structure) This approach would be much less general but would dodge the issue of writing generalized N-D auto-concatenation logic. Final point - this common use case also has the added complexity of having ghost or guard cells around every dataset, which should be thrown away. Clearly some user input is required here ( |
@TomNicholas I think your analysis is correct here. I suspect that in most cases we could figure out how to tile datasets by looking at 1D coordinates along each dimension (e.g., indexes for each dataset), e.g., to find a "chunk id" along each concatenated dimension. These could be used to build something like a NumPy object array of xarray.Dataset/DataArray objects, which could split up into a bunch of 1D calls to I would rather avoid using the
We could potentially just encourage using the existing |
I started having a go at writing the second half of this - the "n-dimensional-concatenation" function which would accept a grid of xarray.DataSet/DataArray objects (assumed to be in the correct order along all dimensions), and return a single merged dataset. However, I think there's an issue with using
My plan was to call from numpy import apply_along_axis
from xarray import concat
def concat_nd(obj_grid, concat_dims=None):
"""
Concatenates a structured ndarray of xarray Datasets along multiple dimensions.
Parameters
----------
obj_grid : numpy array of Dataset and DataArray objects
N-dimensional numpy object array containing xarray objects in the shape they
are to be concatenated. Each object is expected to
consist of variables and coordinates with matching shapes except for
along the concatenated dimension.
concat_dims : list of str or DataArray or pandas.Index
Names of the dimensions to concatenate along. Each dimension in this argument
is passed on to :py:func:`xarray.concat` along with the dataset objects.
Should therefore be a list of valid dimension arguments to xarray.concat().
Returns
-------
combined : xarray.Dataset
"""
# Combine datasets along one dimension at a time
# Start with last axis and finish with axis=0
for axis in reversed(range(obj_grid.ndim)):
obj_grid = apply_along_axis(concat, axis, arr=obj_grid, dim=concat_dims[axis])
# Grid should now only contain one xarray object
return obj_grid.item However, testing this code with def test_concat_1d(self):
data = create_test_data()
split_data = [data.isel(dim1=slice(3)), data.isel(dim1=slice(3, None))]
# Will explain why I'm forced to create ndarray like this shortly
split_data_grid = np.empty(shape=(2), dtype=np.object)
split_data_grid[0] = split_data[0]
split_data_grid[1] = split_data[1]
reconstructed = concat_nd(split_data_grid, ['dim1'])
xrt.assert_identical(data, reconstructed) throws an error from within
I think this is because even just the idea of having a ndarray containing xarray datasets seems to cause problems - if I do it with a single item then xarray thinks I'm trying to convert the Dataset into a numpy array and throws the same error: data = create_test_data()
data_grid = np.array(data, dtype='object') and if I do it with multiple items then numpy will dive down and extract the variables in the dataset instead of just storing a reference to the dataset: data = create_test_data()
split_data = [data.isel(dim1=slice(3)), data.isel(dim1=slice(3, None))]
split_data_grid = np.array(split_data, dtype='object')
print(split_data_grid) returns
when I expected something more like
(This is why I had to create an empty array and then fill it afterwards in my example test further up.) Is this the intended behaviour of xarray? Does this mean I can't use numpy arrays of xarray objects at all for this problem? If so then what structure do you think I should use instead (list of lists etc.)? |
NumPy's handling of object arrays is unfortunately inconsistent. So maybe it isn't the best idea to use NumPy arrays for this. Python's built-in list/dict might be better choices here. Something like: def concat_nd(datasets):
# find the set of dimensions across which to possibly merge
# could possibly use OrderedSet here:
# https://github.com/pydata/xarray/blob/v0.10.8/xarray/core/utils.py#L401
all_dims = set(ds.dims for ds in datasets)
# Create a map from each dimension to a tuple giving the size of each
# dimension on an input dataset. Not all collections of datasets have consistent
# sizes along each dimension, but the ones we can automatically concatenate do.
# I recommend researching how "chunks" work in dask.array:
# http://dask.pydata.org/en/latest/array-design.html
# http://dask.pydata.org/en/latest/array-chunks.html
chunks = {dim: ... for dim in all_dims}
# find the sorted, de-duplicated union of all indexes along those dimensions
# np.unique followed by wrapping with pd.Index()
# might work OK for the "union" function here
combined_indexes = {dim: union([ds.indexes[dim] for ds in datasets]) for dim in all_dims}
# create a map mapping from "tile id" to dataset
# get_indexes() should use pandas.Index.get_indexer to lookup ds.indexes[dim]
# in the combined index, e.g., of type Dict[Tuple[int, ...], xarray.Dataset]
indexes_to_dataset = {get_indexes(ds, chunks, combined_coords): ds for ds in datasets}
# call concat() in a loop to construct the combined dataset |
Thanks @shoyer for the description of how this should be done properly. In the meantime however, I thought I would describe how I solved the problem in my last comment. My method works but you probably wouldn't want to use it in xarray itself because it's pretty "hacky". To avoid the issue of numpy reading the data = create_test_data()
data_grid = np.array([{'key': data}], dtype='object') With this then creating something which will concatenate the numpy grid-like array of (dicts holding) datasets is quick: from xarray import concat
import numpy as np
def _concat_nd(obj_grid, concat_dims=None, data_vars=None, **kwargs):
# Combine datasets along one dimension at a time,
# Have to start with last axis and finish with axis=0 otherwise axes will disappear before the loop reaches them
for axis in reversed(range(obj_grid.ndim)):
obj_grid = np.apply_along_axis(_concat_dicts, axis, arr=obj_grid,
dim=concat_dims[axis], data_vars=data_vars[axis], **kwargs)
# Grid should now only contain one dict which contains the concatenated xarray object
return obj_grid.item()['key']
def _concat_dicts(dict_objs, dim, data_vars, **kwargs):
objs = [dict_obj['key'] for dict_obj in dict_objs]
return {'key': concat(objs, dim, data_vars, **kwargs)} In case anyone is interested then this is how I've (hopefully temporarily) solved the N-D concatenation problem in the case of my data. |
I was thinking about the general solution to this problem again and wanted to clarify some things. Currently I think that any general multi-dimensional version of the
This approach would then be backwards-compatible, accommodate users whose data does not have monotonic indexes (they would just have to arrange their datasets into the correct order themselves first), while still doing the obviously correct thing in unambiguous cases. However this would mean that users wanting to do a multi-dimensional Also I'm assuming we are not going to provide functionality to handle uneven sub-lists, e.g. Edit:I've just realised that there is a lot of related discussion in #2039, #1385, & #1823. I suppose what I'm suggesting here is essentially the N-D generalisation of the approach discussed in those issues, namely an extra argument |
@TomNicholas I agree with your steps 1/2/3 for My concern with a single
Currently we always do (2) and never do (1). We definitely want an option to disable (2) for speed, and also want an option to support (1) (what you propose here). But these are distinct use cases -- we probably want to support all permutations of 1/2.
I'm not sure we need to support this yet -- it would be enough to have keyword argument for falling back to the existing behavior that only supports 1D concatenation in the order provided.
Agreed, not important unless someone really wants/needs it. |
This is fine though right? We can do all of this, because it should compartmentalise fairly easily shouldn't it? You end up with logic like: def auto_combine(ds_sequence, infer_order_from_coords=True, check_alignment=True):
if check_alignment:
# Check alignment along non-concatenated dimensions (your (2))
if infer_order_from_coords:
# Use coordinates to determine tile_ID for each dataset in N-D (your (1))
else:
# Determine tile_IDs by structure of input in N-D (i.e. ordering in list-of-lists)
# Join everything together
return _concat_nd(tile_IDs, ds_sequence)
We don't need to, but I don't think it would be that hard (if the structure above is feasible), and I think it's a common use case. Also there's an argument for putting in special effort to generalize this function as much as possible, because it lowers the barrier to entry for xarray for new users. Though perhaps I'm just biased because it happens to be my use case... Also if we know what form the tile_IDs should take then I can write the |
Yes, this seems totally fine to me.
Sure, no opposition from me if you want to do it! 👍 |
@shoyer see my PR trying to implement this (#2553). Inputting a list of lists into
|
Closed by #2616 |
I'm running xarray v0.12.1, released in June 5 of this year, which should include @TomNicholas's fix merged back in Dec of last year. However, the original MWE still gives the unwanted result with the repeated coordinates. |
I can confirm that this issue persists in v0.12.3 as well. |
Hi @ewquon I think you need to specify |
Thanks @dcherian. As you suggested, I ended up using v0.12.3 and |
Code Sample
Problem description
Currently
xr.open_mfdataset
will detect a single common dimension and concatenate DataSets along that dimension. However a common use case is a set of NetCDF files which have two or more common dimensions that need to be concatenated along simultaneously (for example collecting the output of any large-scale simulation which parallelizes in more than one dimension simultaneously). For the behaviour ofxr.open_mfdataset
to be n-dimensional it should automatically recognise and concatenate along all common dimensions.Expected Output
Current output of
xr.open_mfdataset()
The text was updated successfully, but these errors were encountered: