-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize rolling dask #311
Conversation
Pull Request Test Coverage Report for Build 1318
💛 - Coveralls |
So, it finally wasn't that hard. I added a function that takes care of rolling operation other than sum and mean. It acts the same way as the conventional rolling : creates a new dimension and copies* data so that the rolling op can be applied along that new dim. But it still does this chunk-wise and is lazy, thus not going into a MemoryError on huge datasets.
Also, this can accept any function, not only the numpy ones. |
Questions: Can this be somehow ported to xarray ? |
Yes and no, as far as I understang xarray's code. The whole "rolling" module is completely exempt to any explicit call to dask. I don't really see any elegant way to add this code. Also, there is some time in the overhead, that I just skip. What exactly does this overhead do, and is it useful? I do not know. |
This might indirectly be a solution to #184 where i could never completely understand the snowballing memory problems. I will try to see if this helps once merged to master |
I think we could minimally create an issue on the xarray github describing the problems we are experiencing for long time-series and present @aulemahal 's solution / work around as being more efficient. They might decide to find a way to integrate it themselves? |
I've read similar memory issues happening when bottleneck is not installed. Could you confirm this is not the issue here. |
See pydata/xarray#3332 Might be a place where you could describe what you've done. |
Messing around with various reduction options I see that using 'std' results in very small diffs between xr and utils.rolling() not sure why..
|
It might be prudent to include a chunked version of tests for functions that rely on rolling (or indirectly runlength) in the indicator test suite? E.g. tests on .nc files in test_precip.py, test_temperature.py etc currently do not chunk the test datasets before calculating indices Example from test_precip
Maybe this overkill as we already test rolling generally in test_utils.py.. opinions? |
I think you're right, but I would like this to be parameterized. We don't want to duplicate all tests, we need to use pytest fictures to parameterize test input so that each test runs with both chunked and un-chunked datasets. This goes back to #52 ... I suggest we raise the priority on this. |
@tlogan2000 I wasn't able to replicate your problem... Are you sure @huard I confirm bottleneck is installed in my environment. Also, AFAIK bottleneck is not used by xarray when rolling on a dask array. I'll modifiy the tests with parametrization in mind. |
So, I realize (tnks @tlogan2000 ) that my solution was float-dependent, which isn't good. Fixed it, but this makes me wonder if other edge cases are not covered in my code, so I decided to put the function private. Like : if ever xarray fixes the problem, or we find a more convenient solution, I don't want this to be a breaking change. r = da.rolling(dim=win)
a = r.construct('roll').reduce(np.func, dim='roll', allow_lazy=True)
a = a.where(r._counts() >= win) where the last line is to take care of NaN values and edges. Going through the These changes were really useful for my own script generating indices for the ouranos website, but I would understand if it isn't appropriate for xclim, as it's more a xarray problem. |
Oh! And I added tests following recommendations from #52. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done quite a bit of testing with this now and I think it seems to work quite well for dealing with the memory issues we were having.
@aulemahal Did some preliminary benchmarking comparing this to xr.rolling() and xr.rolling( allow_lazy=True) and it seems to be more efficient however I think it might be worhtwhile to do a more thorough benchmarking of all xclim fucntions requiring ,rolling() in order to get a fuller picture of the benefits in terms of speed / memory use. If overall the benefit is marginal compared to xr.rolling(allow_lazy=True) then we will have to consider whether it is worthwhile to maintain a custum utility long term
I see the py3.5 build is failing but we will be dropping 3.5 shortly anyhow. Black and flake 8 failed as well but should be an easy fix.. |
So, I fixed that black thing. |
I suggest we merge this for now. |
Pull Request Checklist:
After merging to master and before closing the branch:
master
branchCreate a new util function
rolling
to optimize the rolling sums and means over dask arrays.Sum and mean are implemented as they both use the same
numpy.cumsum
trick. Other rolling operations (min, max, std) could be implemented later on.The function does the computation lazily, which xarray's
rolling()
didn't. It defaults to xarray's version is the data is not in a dask array as it uses a special dask function :dask.Array.map_overlap()
.This function is faster than xarray's even for datasets with only one chunk along the rolled dimension. And it solves MemoryErrors when using huge datasets.
I did some Jedi tricks to get the same behaviour as xarray's version when NaNs are encountered. Without the second
np.cumsum
, we can reduce the computation time by 30-50%, but NaNs will be treated as zeros.Does this PR introduce a breaking change?
No
Other information:
I called the function
xc.utils.rolling
, but in comparison toxr.DataArray.rolling
, it actually does the sum or mean. So same name, different behaviour.