Skip to content

[WIP] Fix problem with wrong chunksizes when using rolling_window on dask.array #2532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

cchwala
Copy link
Contributor

@cchwala cchwala commented Oct 31, 2018

Short summary

The two rolling-window functions for dask.array

will be fixed to preserve dask.array chunksizes.

Long summary

The specific initial problem with chunksizes and interpolate_na() in #2514 is caused by the padding done in

if pad_size > 0:
if pad_size < depth[axis]:
# overlapping requires each chunk larger than depth. If pad_size is
# smaller than the depth, we enlarge this and truncate it later.
drop_size = depth[axis] - pad_size
pad_size = depth[axis]
shape = list(a.shape)
shape[axis] = pad_size
chunks = list(a.chunks)
chunks[axis] = (pad_size, )
fill_array = da.full(shape, fill_value, dtype=a.dtype, chunks=chunks)
a = da.concatenate([fill_array, a], axis=axis)

which adds a small array with a small chunk to the initial array.

There is another related problem where DataArray.rolling() changes the size and distribution of dask.array chunks which stems from this code

def dask_rolling_wrapper(moving_func, a, window, min_count=None, axis=-1):

For some (historic) reason there are these two rolling-window functions for dask. Both need to be fixed to preserve chunksize of a dask.array in all cases.

@pep8speaks
Copy link

Hello @cchwala! Thanks for submitting the PR.

Copy link
Member

@fujiisoup fujiisoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cchwala
Thanks for sending a PR.

Some tests are failing, though.

I am not very sure how the rechunking the whole array affects the performance. Do you have any idea?

rechunk_chunks = list(a.chunks)
rechunk_chunks[axis] = rechunk_chunks[axis] + (pad_size,)
a = da.concatenate([fill_array, a], axis=axis).rechunk(
{axis: rechunk_chunks[axis]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is rechunking whole a. Doesn't it affect the performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was also my concern, but I had no time to check yet. Will do to answer this. I did not find a way to "merge" two chunks which would probably be the best thing with regard to performance.

@cchwala
Copy link
Contributor Author

cchwala commented Nov 1, 2018

Yes. Test are still failing. The PR is WIP. I just wanted to open the PR now to have the discussion here instead of in the issues.

I will work on fixing the code to pass all current test. I will also check how the rechunking affects performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataArray.rolling() does not preserve chunksizes in some cases interpolate_na with limit argument changes size of chunks
4 participants