Removes empty partitions after dropping rows and splitting datasets #2328

geoffreyangus · 2022-07-28T21:16:50Z

This PR addresses two separate issues: #2324 and #2308.

The issues are addressed by culling empty partitions from the Dask DataFrame at two points: (1) after dropping rows with NaNs (part of the DROP_ROWS missing value strategy) and (2) after splitting the dataset into train/val/test.

In order to maintain/increase performance, we add a persist call at the end of build_dataset, which makes it relatively inexpensive to compute the length of partitions repeatedly downstream.

…rame creation

for more information, see https://pre-commit.ci

github-actions · 2022-07-28T21:59:47Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 30m 37s ⏱️ + 7m 52s
2 947 tests +1 2 898 ✔️ +1   49 💤 ±0 0 ❌ ±0
8 841 runs +3 8 658 ✔️ +3 183 💤 ±0 0 ❌ ±0

Results for commit 5cd4d49. ± Comparison against base commit 44afa4f.

♻️ This comment has been updated with latest results.

arnavgarg1

Thanks for making this change - looks good to me!

…i/ludwig into remove-empty-partitions

for more information, see https://pre-commit.ci

…set <=> Dask conversions

for more information, see https://pre-commit.ci

arnavgarg1 · 2022-08-03T15:18:39Z

ludwig/data/dataframe/dask.py

@@ -129,7 +149,24 @@ def to_ray_dataset(self, df):
        return from_dask(df)

    def from_ray_dataset(self, dataset) -> dd.DataFrame:
-        return dataset.to_dask()
+        """Custom Ray to Dask conversion implementation to pass in meta during dd.DataFrame creation."""


@geoffreyangus should we revert back to dataset.to_dask() if the empty partitions issue doesn't stem from a need to pass in meta?

Yup, reverted!

arnavgarg1 · 2022-08-04T17:06:10Z

LGTM!

arnavgarg1 and others added 9 commits July 27, 2022 00:58

Custom to_dask() implementation that also passes meta during dd.DataF…

dd83127

…rame creation

Using 100 rows instead

3ba39cd

Pin Ray nightly version

55d3019

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

0a1477e

fix link

05fb78a

pin torch to 07/26

ddb4ada

cleanup

adea85c

Removes empty partitions after dropping rows and splitting datasets

9506182

remove extraneous comment in known_divisions

590cddb

geoffreyangus mentioned this pull request Jul 28, 2022

Ludwig does not gracefully handle empty partitions during saving #2324

Closed

[pre-commit.ci] auto fixes from pre-commit.com hooks

9a81617

for more information, see https://pre-commit.ci

arnavgarg1 linked an issue Jul 28, 2022 that may be closed by this pull request

Metadata mismatch while calculating overall stats #2308

Closed

arnavgarg1 approved these changes Jul 28, 2022

View reviewed changes

geoffreyangus and others added 15 commits July 28, 2022 17:12

add unit test

d5402f5

Merge branch 'remove-empty-partitions' of https://github.com/ludwig-a…

d94ad85

…i/ludwig into remove-empty-partitions

upgrade ray pinned version to enable parquet partition filtering

83ea53b

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f712f4

for more information, see https://pre-commit.ci

Merge branch 'custom_to_dask' of https://github.com/ludwig-ai/ludwig

6f0ac1d

added preliminary check for empty partitions to improve speed

dad7605

downgrade Ray to ensure TensorDtypes are not inferred during Ray Data…

d3c2a5a

…set <=> Dask conversions

merge

8586c70

[pre-commit.ci] auto fixes from pre-commit.com hooks

51dfa85

for more information, see https://pre-commit.ci

Merge branch 'pin-ray-nightly' into remove-empty-partitions

6116a71

[pre-commit.ci] auto fixes from pre-commit.com hooks

566b755

for more information, see https://pre-commit.ci

add NoneType check to split

883d008

unpin torch

c03bf4f

Merge branch 'pin-ray-nightly' into remove-empty-partitions

289e974

move persist call to this PR

459255a

geoffreyangus marked this pull request as ready for review August 1, 2022 19:02

geoffreyangus requested review from ShreyaR and jeffreyftang August 1, 2022 19:03

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

825fe72

ShreyaR approved these changes Aug 3, 2022

View reviewed changes

geoffreyangus added 2 commits August 3, 2022 08:15

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

f9d715e

Merge branch 'master' into remove-empty-partitions

d77ee41

arnavgarg1 reviewed Aug 3, 2022

View reviewed changes

geoffreyangus added 6 commits August 3, 2022 09:23

revert to to_dask()

aa221b0

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

724d202

Merge branch 'master' into remove-empty-partitions

fb53d72

reverted custom to_dask and isolated ray into DaskEngine methods

d079969

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

6efa162

Merge branch 'master' into remove-empty-partitions

5cd4d49

geoffreyangus merged commit dc047cd into master Aug 4, 2022

geoffreyangus deleted the remove-empty-partitions branch August 4, 2022 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes empty partitions after dropping rows and splitting datasets #2328

Removes empty partitions after dropping rows and splitting datasets #2328

geoffreyangus commented Jul 28, 2022 •

edited

Loading

github-actions bot commented Jul 28, 2022 •

edited

Loading

arnavgarg1 left a comment

arnavgarg1 Aug 3, 2022

geoffreyangus Aug 4, 2022

arnavgarg1 commented Aug 4, 2022

Removes empty partitions after dropping rows and splitting datasets #2328

Removes empty partitions after dropping rows and splitting datasets #2328

Conversation

geoffreyangus commented Jul 28, 2022 • edited Loading

github-actions bot commented Jul 28, 2022 • edited Loading

Unit Test Results

arnavgarg1 left a comment

Choose a reason for hiding this comment

arnavgarg1 Aug 3, 2022

Choose a reason for hiding this comment

geoffreyangus Aug 4, 2022

Choose a reason for hiding this comment

arnavgarg1 commented Aug 4, 2022

geoffreyangus commented Jul 28, 2022 •

edited

Loading

github-actions bot commented Jul 28, 2022 •

edited

Loading