-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve to_dask_dataframe performance #7844
Improve to_dask_dataframe performance #7844
Conversation
Illviljan
commented
May 15, 2023
•
edited
Loading
edited
- ds.chunks loops all the variables, do it once.
- Faster to create a meta dataframe once than letting dask guess 2000 times.
xarray/core/dataset.py
Outdated
@@ -6422,8 +6429,13 @@ def to_dask_dataframe( | |||
if not is_duck_dask_array(var._data): | |||
var = var.chunk() | |||
|
|||
dask_array = var.set_dims(ordered_dims).chunk(self.chunks).data | |||
series = dd.from_array(dask_array.reshape(-1), columns=[name]) | |||
if has_many_dims: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really that impactful, can we optimize set_dims
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think I'll save the has_many_dims paths for a future PR. I think it might introduce bugs if we don't consistently chunk with the same shape.
|
* Improve to_dask_dataframe performance * Add ASV test * Update pandas.py * Update dataset.py