Use Ray Datasets to read binary files in parallel #2241

tgaddair · 2022-07-07T16:04:18Z

No description provided.

github-actions · 2022-07-07T16:55:19Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 44m 56s ⏱️ - 11m 57s
2 948 tests +1 2 899 ✔️ +2   49 💤 ±0 0 ❌ - 1
8 844 runs +3 8 661 ✔️ +4 183 💤 ±0 0 ❌ - 1

Results for commit e8b0160. ± Comparison against base commit dc047cd.

♻️ This comment has been updated with latest results.

for more information, see https://pre-commit.ci

…to fast-im-read

for more information, see https://pre-commit.ci

…to fast-im-read

for more information, see https://pre-commit.ci

geoffreyangus · 2022-08-04T20:57:50Z

This PR improves the way path-specified Image/Audio features are loaded during preprocessing. Prior to this change, Image/Audio paths were loaded and placed directly into their source partitions, allowing partitions to balloon in size and causing undue memory pressure. The change implemented here allows the Ray backend to create new partitions when reading paths. The results of such a change are promising across a variety of benchmarking datasets:

Dataset Branch duration (secs)
master fast-im-read
Tabular: Criteo (100MB) 56.88 56.37
Tabular: Criteo (1GB) 289.9 292.55
Image: iSpy2 (~5k rows) 450.62 95.22
Image+Text: Twitter Bots (~43k rows) 1568.34 291.38
Audio: respiratory (~7k rows) N/A (crashed after ~1,980) 308.75
Image+Text: H&M Shopping (~95k rows) N/A 4503.92
A few callouts:

If training on Image/Audio features on a Ray/Dask backend, a globally unique index is now explicitly required in order to re-align the partitions across various features. Under certain conditions, preprocessing.build_dataset will automatically reset the indices to ensure they are globally unique.

The number of new partitions created during Image/Audio path preprocessing will always be at least min(len(dataset), 200), where 200 is the Ray default number of parallel readers. This number will increase to ensure that the estimated size of the resulting partitions is less than 50MB.

An issue with Modin was discovered during implementation. A small workaround on the Ray backend will be reverted once the issue is resolved.

^^resurfacing this

ShreyaR

Wow, this was a pretty substantial change with a bunch of optimizations throughout image reading! Thanks a lot for getting this in 🎉

arnavgarg1

Amazing work @geoffreyangus! LGTM 🚢

Use Ray Datasets to read binary files in parallel

9411a71

tgaddair requested review from geoffreyangus and arnavgarg1 July 7, 2022 16:04

geoffreyangus and others added 26 commits July 14, 2022 10:46

Merge branch 'master' into fast-im-read

f2fad56

working version (without NaNs)

0b894b0

Works with NaNs

c137cd9

remove TODO item

f5b75dd

[pre-commit.ci] auto fixes from pre-commit.com hooks

e59f268

for more information, see https://pre-commit.ci

wip– cannot find files on remote filesystems

1c424af

hack that handles NaNs

f4c9177

starting work on custom datasource starting from file data source

e3d928e

NaN handling using custom data source

0c72668

http handling

d824093

flakiness in roc

521367b

[pre-commit.ci] auto fixes from pre-commit.com hooks

437ffd7

for more information, see https://pre-commit.ci

revert audio feature

85f1c4c

Merge branch 'fast-im-read' of https://github.com/ludwig-ai/ludwig in…

47c254d

…to fast-im-read

normalize NaNs to None

1e854c1

wip: tests flaky

1674fbf

Merge branch 'master' into fast-im-read

fee99da

fixed flakiness by removing forced nan in generate data

7957705

cleanup

728ecfe

cleanup

b4f2f97

cleanup from benchmarking

ae4026f

[pre-commit.ci] auto fixes from pre-commit.com hooks

f240421

for more information, see https://pre-commit.ci

add ray test for pre-loaded numpy images

b6551af

Merge branch 'fast-im-read' of https://github.com/ludwig-ai/ludwig in…

4d4a60a

…to fast-im-read

[pre-commit.ci] auto fixes from pre-commit.com hooks

493cb65

for more information, see https://pre-commit.ci

added modin support

c9405c5

geoffreyangus force-pushed the fast-im-read branch from 18c5c50 to 00ccfea Compare July 29, 2022 21:23

geoffreyangus added 23 commits July 29, 2022 16:05

add persist to save time

a13316b

add NoneType check to split

883d008

Merge branch 'remove-empty-partitions' into fast-im-read

06d1731

unpin torch

c03bf4f

Merge branch 'pin-ray-nightly' into remove-empty-partitions

289e974

move persist call to this PR

459255a

Merge branch 'remove-empty-partitions' into fast-im-read

b538f78

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

825fe72

fix merge conflicts

523a9af

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

f9d715e

Merge branch 'master' into remove-empty-partitions

d77ee41

revert to to_dask()

aa221b0

Merge branch 'remove-empty-partitions' into fast-im-read

f578bf3

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

724d202

Merge branch 'master' into remove-empty-partitions

fb53d72

Merge branch 'remove-empty-partitions' into fast-im-read

4d90905

reverted custom to_dask and isolated ray into DaskEngine methods

d079969

Merge branch 'remove-empty-partitions' into fast-im-read

ced5310

Merge branch 'master' of https://github.com/ludwig-ai/ludwig

6efa162

Merge branch 'master' into remove-empty-partitions

5cd4d49

Merge branch 'remove-empty-partitions' into fast-im-read

b572f1c

merge with master

e9774d1

Merge branch 'master' into fast-im-read

e8b0160

geoffreyangus requested a review from ShreyaR August 4, 2022 19:40

ShreyaR approved these changes Aug 4, 2022

View reviewed changes

arnavgarg1 approved these changes Aug 5, 2022

View reviewed changes

geoffreyangus merged commit 9cd95af into master Aug 8, 2022

geoffreyangus deleted the fast-im-read branch August 8, 2022 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Ray Datasets to read binary files in parallel #2241

Use Ray Datasets to read binary files in parallel #2241

tgaddair commented Jul 7, 2022

github-actions bot commented Jul 7, 2022 •

edited

Loading

geoffreyangus commented Aug 4, 2022

ShreyaR left a comment

arnavgarg1 left a comment

Use Ray Datasets to read binary files in parallel #2241

Use Ray Datasets to read binary files in parallel #2241

Conversation

tgaddair commented Jul 7, 2022

github-actions bot commented Jul 7, 2022 • edited Loading

Unit Test Results

geoffreyangus commented Aug 4, 2022

ShreyaR left a comment

Choose a reason for hiding this comment

arnavgarg1 left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 7, 2022 •

edited

Loading