-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataArray.pad, Dataset.pad, Variable.pad #3596
Conversation
…to dask.array.pad
…instead of list of tuples
Hello @mark-boer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-03-18 21:58:02 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mark-boer . This is a great start. I see this is your first PR. Welcome to xarray!
I've looked through this quickly but haven't tested the behaviour yet. So here are some quick comments
The lint errors can be fixed by running black. See https://xarray.pydata.org/en/stable/contributing.html#code-formatting
I moved the custom dask pad method into dask_array_compat to ensure backwards compatability to Dask versions that do not have dask.pad yet. We could chose to drop this support if we wanted to.
Yes our minimum dask version is 1.2 and pad appeared in dask 0.18.1 so we can safely assume that dask.array.pad exists
I'm still in doubt about the function signature, numpy as dask use optional kwargs, but that kinda interferes with the **pad_width_kwargs. I chose the signature that I thought looked least awkward.
This signature looks fine to me.
The default behaviour of pad with mode=constant pads with NaN's converting the array to float in the process. This goes against the default behaviour of numpy as dask.
Let's have the user explicitly specify constant_values
in this case? i.e. `if mode == "constant" and constant_values is None" then raise an error.
How should the coordinates of a DataArray be padded? I chose default padding except for modes "edge", "reflect", "symmetric", "wrap".
Hmm... it may make sense to linearly extrapolate coordinate values. This will need some thinking.
How should we handle inconsistencies between numpy.pad and Dask.pad, it turns out there are a couple
I think in general we want to replicate numpy if it's not too hard. Though upstream fixes would be best.
Dataset.pad is coming up.
dtype, fill_value = dtypes.maybe_promote(self.dtype) | ||
# change default behaviour of pad with mode constant | ||
if mode == "constant" and constant_values is None: | ||
dtype, constant_values = dtypes.maybe_promote(self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise an error here instead of automatically casting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some kind of default behaviour, there is currently no way to add the fill_value for the coords of a data array. This behaviour was already implemented by the variable.pad_with_fill_value
method as it existed in xarray, I thought I would leave it as is.
We could chose to pad with 0's as numpy does it, but I don't think I would prefer this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have constant_values=np.nan
in the signature then? and mention that ints will be cast to floats by default?
Also this seems counter-intuitive
>>> da.pad(x=(1,1))
<xarray.DataArray (x: 4, y: 4)>
array([[nan, nan, nan, nan],
[ 0., 1., 2., 3.],
[10., 11., 12., 13.],
[nan, nan, nan, nan]])
Coordinates:
* x (x) float64 nan 0.0 1.0 nan
* y (y) int64 10 20 30 40
z (x) float64 nan 100.0 200.0 nan
>>> da.pad(x=(1,1), constant_values=np.nan)
<xarray.DataArray (x: 4, y: 4)>
array([[-9223372036854775808, -9223372036854775808, -9223372036854775808,
-9223372036854775808],
[ 0, 1, 2,
3],
[ 10, 11, 12,
13],
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
-9223372036854775808]])
Coordinates:
* x (x) float64 nan 0.0 1.0 nan
* y (y) int64 10 20 30 40
z (x) float64 nan 100.0 200.0 nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I hit similar issues, though less complex in #2470
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably add the check for dtypes.NA as well:
if mode == "constant" and (constant_values is None or constant_values is dtypes.NA)
Hi, @mark-boer. Thanks for your contribution :) |
…parametrize in test_variable.test_pad
It seems like we have some value mismatches on
|
The reason that |
…s, and add an additional check if the shape of output
Hi everyone, happy new year 🎉 I feel like this PR is slowly getting into a descent shape. But I could use your input on a couple of subjects: How should we handle the default padding? We could chose to pad with 0's as numpy does it, or we could force the user to choose a constantt_value, but I don't think I would prefer this. How should the coordinates of a DataArray/set be padded? Personally I think it more often than not it makes sense, but padding with NaN's should also work fine. dask_array_compat I used a couple of |
08984e0
to
ba3f0a4
Compare
Hi @dcherian and @fujiisoup, appart from the issues I raised in my last comment, I think this PR is close to done. Thx in advance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mark-boer. This is looking great and seems quite close.
""" | ||
pad_width = either_dict_or_kwargs(pad_width, pad_width_kwargs, "pad") | ||
|
||
if mode in ("edge", "reflect", "symmetric", "wrap"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Perhaps we should expose this as a user-facing option
coords_mode
?
This would control padding for both dimensional and non-dimensional coordinate variables.
da = xr.DataArray([[0,1,2,3], [0,1,2,3]],
dims=["x", "y"],
coords={"x": [0,1], "y": [10, 20, 30, 40]})
padded = da.pad(x=(1,1))
padded
<xarray.DataArray (x: 4, y: 4)>
array([[nan, nan, nan, nan],
[ 0., 1., 2., 3.],
[ 0., 1., 2., 3.],
[nan, nan, nan, nan]])
Coordinates:
* x (x) float64 nan 0.0 1.0 nan
* y (y) int64 10 20 30 40
Filling in the NaNs in x
using linear extrapolation is surprisingly hard:
padded["x"] = padded.x.drop_vars("x").interpolate_na("x", fill_value="extrapolate")
This makes me think we should support a coords_mode="extrapolate"
? It seems the most sensible default option for dimensional or index coordinates such as x
or y
in that example.
The logic could be reused to support mode="extrapolate"
too.
EDIT: I guess extrapolate
could be implemented using linear_ramp
we just need to figure out what the end values are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the points are roughly linear one could use reflect
with reflect_type='odd'
. That will more or less do an extrapolation.
I could see that something like coords_mode being useful, but I'm not sure how much use this would get.
dtype, fill_value = dtypes.maybe_promote(self.dtype) | ||
# change default behaviour of pad with mode constant | ||
if mode == "constant" and constant_values is None: | ||
dtype, constant_values = dtypes.maybe_promote(self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have constant_values=np.nan
in the signature then? and mention that ints will be cast to floats by default?
Also this seems counter-intuitive
>>> da.pad(x=(1,1))
<xarray.DataArray (x: 4, y: 4)>
array([[nan, nan, nan, nan],
[ 0., 1., 2., 3.],
[10., 11., 12., 13.],
[nan, nan, nan, nan]])
Coordinates:
* x (x) float64 nan 0.0 1.0 nan
* y (y) int64 10 20 30 40
z (x) float64 nan 100.0 200.0 nan
>>> da.pad(x=(1,1), constant_values=np.nan)
<xarray.DataArray (x: 4, y: 4)>
array([[-9223372036854775808, -9223372036854775808, -9223372036854775808,
-9223372036854775808],
[ 0, 1, 2,
3],
[ 10, 11, 12,
13],
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
-9223372036854775808]])
Coordinates:
* x (x) float64 nan 0.0 1.0 nan
* y (y) int64 10 20 30 40
z (x) float64 nan 100.0 200.0 nan
xarray/tests/test_variable.py
Outdated
"mode", | ||
[ | ||
"mean", | ||
pytest.param("median", marks=pytest.mark.xfail), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
1. Add warning category 2. Use variable for pad arguments when testing 3. Add example.
I pushed some minor changes. I think this is ready to go in. The big outstanding issue is what to do about dimension coordinates or indexes. Currently this PR treats all variables in I am thinking that we want to use linear extrapolation for IndexVariables by default and apply the same padding mode to all other variables. The reasoning being that IndexVariables with NaNs are hard to deal with and it's hard to fill them in: |
z (x) float64 nan 100.0 200.0 nan | ||
>>> da.pad(x=1, constant_values=np.nan) | ||
<xarray.DataArray (x: 4, y: 4)> | ||
array([[-9223372036854775808, -9223372036854775808, -9223372036854775808, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't like it, but is is the way that most xarray functions handle this. Although I should add a check for xarray.dtypes.NA
. We have discussed adding xarray.dtypes.NA
as the default value for the constant_values keyword. But unfortunately this kind of interferes with the fact that you cannot specify constant_values when you set the mode to anything other than "constant". So if you have any ideas, I'm all ears.
>>> da = xr.DataArray(np.arange(9).reshape(3,3), dims=("x", "y"))
>>> da.shift(x=1, fill_value=np.nan)
array([[-9223372036854775808, -9223372036854775808, -9223372036854775808],
[ 0, 1, 2],
[ 3, 4, 5]])
Dimensions without coordinates: x, y
>>> da.rolling(x=3).construct("new_axis", stride=3, fill_value=np.nan)
<xarray.DataArray (x: 1, y: 3, new_axis: 3)>
array([[[-9223372036854775808, -9223372036854775808, 0],
[-9223372036854775808, -9223372036854775808, 1],
[-9223372036854775808, -9223372036854775808, 2]]])
Dimensions without coordinates: x, y, new_axis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think it currently works well now, but we should probably have a different example: passing np.nan
is a mistake, and instead users can rely on the default and it'll coerce to float with dtypes.NA
implied.
Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That looks to me like a reasonable default.
This looks excellent @mark-boer , thank you! I will try and have a proper look through today (but don't wait for me) |
I do agree it can be confusing, but it is not unique in xarray. Dataset.shift only shifts Personally I think that extrapolating data coordinates without specifically settings a keyword or flag could also be confusing. I occasionally have data in my Both defaults could be really useful, I'm still in a bit of doubt. Are there any other people that might want to weigh in? |
I agree, I wouldn't have expected coords to be included given existing behavior. People can switch coords <> data_vars as needed, so there's an escape hatch Edit: But it's more awkward for indexes than non-index coords. The index becomes less useful with non-unique values, and generally indexes don't have nulls. I'm not sure what the other options would be: to some extent it's the intersection of |
Hmm, I don't really see a solution. What do you suggest? |
:) I think we need to extrapolate indexes by default. It seems like the most sensible option. |
Sorry, obviously that is a solution ;-). But I do have some concerns: In some instances extrapolating all coords, can lead to some unwanted behaviour. Would you suggest we only interpolate the indexes? How would we handle unsorted indexes? How would we extrapolate all the different kind of indexes, like the MultiIndex or CategoricalIndex? |
I agree; I can't see easy solutions to these. If there are evenly spaced indexes (e.g. dates, grids), then it's easy to know what to do. But there are plenty of times it's difficult, if not intractable. One option to merge something useful without resolving these questions is to return Variables only and label this method experimental. I think this is immediately useful for the cases where these difficult questions don't need to be answered. Of course if there are good answers to the questions, then even better! |
Extremely good points @mark-boer I propose we merge and open an issue to decide what to do with IndexVariables. |
Agree! We could add an "Experimental" label and then worry less about future changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments, other than that this looks good to me
Merging. I've added an experimental warning to the docstrings and we can discuss the IndexVariable situation here: #3868 Thanks @mark-boer this is a major contribution for your first PR! |
Thanks @mark-boer ! Must be one of the largest first contributions... +1 re merge + experimental warning; maybe we should do this more often. Cheers @dcherian |
Wow, didn't look for a week. Very happy 🎉! Thank you @dcherian and @max-sixty for all your input.
I was inspired by the hacktoberfest last year. Took a little bit over a month though 😛 |
* upstream/master: add spacing in the versions section of the issue report (pydata#3876) map_blocks: allow user function to add new unindexed dimension. (pydata#3817) Delete associated indexes when deleting coordinate variables. (pydata#3840) remove macos build while waiting for libwebp fix (pydata#3875) Fix html repr on non-str keys (pydata#3870) Allow ellipsis to be used in stack (pydata#3826) Improve where docstring (pydata#3836) Add DataArray.pad, Dataset.pad, Variable.pad (pydata#3596) Fix some warnings (pydata#3864) Feature/weighted (pydata#2922) Fix recombination in groupby when changing size along the grouped dimension (pydata#3807) Blacken the doctest code in docstrings (pydata#3857) Fix multi-index with categorical values. (pydata#3860) Fix interp bug when indexer shares coordinates with array (pydata#3758) Fix alignment with join="override" when some dims are unindexed (pydata#3839)
Hello all,
This is my first PR to a pydata project. This pull request is still very much a work in progress and I could really use your input on a couple of things.
black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API**pad_width_kwargs
. I chose the signature that I thought looked least awkward.mode=constant
pads with NaN's converting the array to float in the process. This goes against the default behaviour of numpy as dask.Dataset.pad is coming up.