Improve to_dask_dataframe performance #7844

Illviljan · 2023-05-15T20:08:24Z

ds.chunks loops all the variables, do it once.
Faster to create a meta dataframe once than letting dask guess 2000 times.

dcherian · 2023-05-18T18:54:53Z

xarray/core/dataset.py

@@ -6422,8 +6429,13 @@ def to_dask_dataframe(
            if not is_duck_dask_array(var._data):
                var = var.chunk()

-            dask_array = var.set_dims(ordered_dims).chunk(self.chunks).data
-            series = dd.from_array(dask_array.reshape(-1), columns=[name])
+            if has_many_dims:


Is this really that impactful, can we optimize set_dims instead?

I'll think I'll save the has_many_dims paths for a future PR. I think it might introduce bugs if we don't consistently chunk with the same shape.

Illviljan · 2023-05-21T11:54:28Z

        before           after         ratio
     [05c7888d]       [d135ab97]
-      2.47±0.02s          806±6ms     0.33  pandas.ToDataFrameDask.time_to_dataframe

* Improve to_dask_dataframe performance * Add ASV test * Update pandas.py * Update dataset.py

Improve to_dask_dataframe performance

e68c168

Illviljan added the topic-performance label May 15, 2023

Illviljan mentioned this pull request May 15, 2023

Improve concat performance #7824

Merged

6 tasks

Illviljan added the run-benchmark Run the ASV benchmark workflow label May 16, 2023

dcherian reviewed May 18, 2023

View reviewed changes

Illviljan added 4 commits May 20, 2023 21:43

Add ASV test

fa781de

Merge branch 'main' into improve_dask_dataframe_performance

ff5a364

Update pandas.py

62ad6b4

Update dataset.py

8162a9a

Illviljan added the plan to merge Final call for comments label May 24, 2023

dcherian approved these changes May 24, 2023

View reviewed changes

Illviljan merged commit 609a901 into pydata:main May 25, 2023

dcherian mentioned this pull request Jun 15, 2023

Comprehensive benchmarking suite #4648

Open

19 tasks

dstansby pushed a commit to dstansby/xarray that referenced this pull request Jun 28, 2023

Improve to_dask_dataframe performance (pydata#7844)

9cb0bdb

* Improve to_dask_dataframe performance * Add ASV test * Update pandas.py * Update dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve to_dask_dataframe performance #7844

Improve to_dask_dataframe performance #7844

Uh oh!

Illviljan commented May 15, 2023 •

edited

Loading

Uh oh!

dcherian May 18, 2023

Uh oh!

Illviljan May 21, 2023

Uh oh!

Illviljan commented May 21, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Improve to_dask_dataframe performance #7844

Improve to_dask_dataframe performance #7844

Uh oh!

Conversation

Illviljan commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian May 18, 2023

Choose a reason for hiding this comment

Uh oh!

Illviljan May 21, 2023

Choose a reason for hiding this comment

Uh oh!

Illviljan commented May 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Illviljan commented May 15, 2023 •

edited

Loading

Illviljan commented May 21, 2023 •

edited

Loading