-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement interp for interpolating between chunks of data (dask) #4155
Conversation
On my computer it passes pytest:
|
Thanks for this contribution @pums974! We appreciate your patience in awaiting a review of your PR. |
No problem, we are all very busy. But thanks for your message. |
Hi @pums974 Thanks for sending the PR. A few comments; da.interp(y=[0, -1, 2]) I'm feeling that the basic algorithm, such as |
Also in my local environment, it gives The full stack trace is
|
Thanks, As for implementing this in dask, you may be right, it probably belong there, And for unsorted destination, that's something I didn't think about. |
OK. In missing.py, we can call this function. |
Hum, ok, but I don't see how it would work if all points are between chunks (see my second example) |
Maybe we can support sequential interpolation only at this moment. res = data.interp(x=np.linspace(0, 1), y=0.5) can be interpreted as res = data.interp(x=np.linspace(0, 1)).interp(y=0.5) which might not be too difficult. |
ok, but what about res = data.interp(y=0.5) |
I mean, in this case you have to interpolate in another direction. You cannot consider having a 1d function. |
This PR looks good for me. |
Thanks @pums974 :) |
You're welcome :) |
* 'master' of github.com:pydata/xarray: (260 commits) Increase support window of all dependencies (pydata#4296) Implement interp for interpolating between chunks of data (dask) (pydata#4155) Add @mathause to current core developers. (pydata#4335) install sphinx-autosummary-accessors from conda-forge (pydata#4332) Use sphinx-accessors-autosummary (pydata#4323) ndrolling fixes (pydata#4329) DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327) pin sphinx to 3.1(pydata#4326) nd-rolling (pydata#4219) Implicit dask import 4164 (pydata#4318) allow customizing the inline repr of a duck array (pydata#4248) silence the known docs CI issues (pydata#4316) enh: fixed pydata#4302 (pydata#4315) Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310) Fix map_blocks example (pydata#4305) Fix docstring for missing_dims argument to isel methods (pydata#4298) Support for PyCharm remote deployment (pydata#4299) Update map_blocks and map_overlap docstrings (pydata#4303) Lazily load resource files (pydata#4297) warn about the removal of the ufuncs (pydata#4268) ...
Hi |
@cyhsu |
@fujiisoup Thanks for letting me know. But I am still unable to do even though I have updated my xarray via "conda update xarray". |
@cyhsu Yes, because it is not yet released. |
* upstream/master: (34 commits) Fix bug in computing means of cftime.datetime arrays (pydata#4344) fix some str accessor inconsistencies (pydata#4339) pin matplotlib in ci/requirements/doc.yml (pydata#4340) Clarify drop_vars return value. (pydata#4244) Support explicitly setting a dimension order with to_dataframe() (pydata#4333) Increase support window of all dependencies (pydata#4296) Implement interp for interpolating between chunks of data (dask) (pydata#4155) Add @mathause to current core developers. (pydata#4335) install sphinx-autosummary-accessors from conda-forge (pydata#4332) Use sphinx-accessors-autosummary (pydata#4323) ndrolling fixes (pydata#4329) DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327) pin sphinx to 3.1(pydata#4326) nd-rolling (pydata#4219) Implicit dask import 4164 (pydata#4318) allow customizing the inline repr of a duck array (pydata#4248) silence the known docs CI issues (pydata#4316) enh: fixed pydata#4302 (pydata#4315) Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310) Fix map_blocks example (pydata#4305) ...
* upstream/master: (40 commits) Fix bug in computing means of cftime.datetime arrays (pydata#4344) fix some str accessor inconsistencies (pydata#4339) pin matplotlib in ci/requirements/doc.yml (pydata#4340) Clarify drop_vars return value. (pydata#4244) Support explicitly setting a dimension order with to_dataframe() (pydata#4333) Increase support window of all dependencies (pydata#4296) Implement interp for interpolating between chunks of data (dask) (pydata#4155) Add @mathause to current core developers. (pydata#4335) install sphinx-autosummary-accessors from conda-forge (pydata#4332) Use sphinx-accessors-autosummary (pydata#4323) ndrolling fixes (pydata#4329) DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327) pin sphinx to 3.1(pydata#4326) nd-rolling (pydata#4219) Implicit dask import 4164 (pydata#4318) allow customizing the inline repr of a duck array (pydata#4248) silence the known docs CI issues (pydata#4316) enh: fixed pydata#4302 (pydata#4315) Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310) Fix map_blocks example (pydata#4305) ...
@fujiisoup Thanks for the response. Since I have not updated my xarray package through this beta version. I hope you can answer my additional question for me. By considering the interpolation, which way is faster? a. chunk the dataset, and then do interpolation or b. chunk the interpolation list and then do interpolation? a.
b.
x = xr.DataArray(data = da.from_array(np.linspace(0,1), chunks=2), dims='x') |
@cyhsu I can answer this question. For best performance you should chunk the input array on the non interpolated dimensions and chunk the destination. Aka :
|
@pums974 then how about if we do the interpolation by using chunk input array to the chunk interpolated dimension? |
If the input array is chunked in the interpolated dimension, the chunks will be merged during the interpolation. This may induce a large memory cost at some point, but I do not know how to avoid it... |
Do this answer your question? |
Gotcha! Yes, it is. If I have many points in lat, lon, depth, and time, I should better chunk my input arrays at this stage to speed up the performance. The reason why I asked this question is I thought chunking the input array to do the interpolation should faster than if I didn't chunk the input array. But in my test case, it is not. Please see the attached. The results I show here is the parallel one way slower than the normal case. |
In your case, each task (20 000²) will have the entire input (100²), and interpolate a few points (5²). Maybe the overhead comes with duplicating the input array 20 000² times, maybe it comes with the fact that you are doing 20 000² small interpolation instead of 1 big interpolation |
I forgot to take into account that the interpolations are orthogonal So plenty of room for overhead... |
And I forgot to take into account that your interpolation only need 48² points of the input array, so the input array will be reduced at the start of the process (you can replace every 100 by 48 in my previous answers) |
@max-sixty Is there a timeline on when we can expect this feature in a stable release? Is it scheduled for the next minor release and to be made available on |
In a project of mine I need to interpolate a dask-based xarray between chunk of data.
When using the current official
interp
function (xarray v0.15.1), the code:fails with
NotImplementedError: Chunking along the dimension to be interpolated (0) is not yet supported.
, but succeed with this versionI also want to alert that my version does not work with "advanced interpolation" (as shown in the xarray documentation)
Also, my version cannot be used to make
interpolate_na
work with chunked dataisort -rc . && black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API