-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid in-memory broadcasting when converting to_dask_dataframe #7472
Conversation
xarray/core/dataset.py
Outdated
dask_array = var.set_dims(ordered_dims).chunk(self.chunks).data | ||
series = dd.from_array(dask_array.reshape(-1), columns=[name]) | ||
dask_array_raveled = ravel_chunks(dask_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask_array_raveled = ravel_chunks(dask_array) |
Unfortunately we can't do this, at least not by default.
We could ask dask to add this behaviour as an opt-in kwarg for dask.dataframe.from_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come?
If we go back to using .reshape(-1)
or .ravel()
we will continue getting this warning:
PerformanceWarning: Reshaping is producing a large chunk. To accept the large
chunk and silence this warning, set the option
with dask.config.set(**{'array.slicing.split_large_chunks': False}):
array.reshape(shape)
To avoid creating the large chunks, set the option
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning
array.reshape(shape, limit='128 MiB')
exec_fun(compile(ast_code, filename, 'exec'), ns_globals, ns_locals)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reshape/ravel have an implied order. With this change the ordering of rows in the output dataframe depends on the chunking of the input array, which would be confusing as default behaviour
I think the warning is fine. Users can override with the dask context manager as suggested in the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it's a big improvement. If you're dying to add it someplace, polyfit would be a good candidate (and very impactful PR).
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
I like these kinds of improvements :) With ravel_chunks:
With reshape
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Great PR!
Turns out that there's a call to
.set_dims
that forces a broadcast on the numpy coordinates.whats-new.rst
Debugging script: