Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Ray Datasets to read binary files in parallel #2241

Merged
merged 112 commits into from
Aug 8, 2022
Merged

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Jul 7, 2022

No description provided.

@github-actions
Copy link

github-actions bot commented Jul 7, 2022

Unit Test Results

       6 files  ±0         6 suites  ±0   2h 44m 56s ⏱️ - 11m 57s
2 948 tests +1  2 899 ✔️ +2    49 💤 ±0  0  - 1 
8 844 runs  +3  8 661 ✔️ +4  183 💤 ±0  0  - 1 

Results for commit e8b0160. ± Comparison against base commit dc047cd.

♻️ This comment has been updated with latest results.

@geoffreyangus geoffreyangus requested a review from ShreyaR August 4, 2022 19:40
@geoffreyangus
Copy link
Contributor

This PR improves the way path-specified Image/Audio features are loaded during preprocessing. Prior to this change, Image/Audio paths were loaded and placed directly into their source partitions, allowing partitions to balloon in size and causing undue memory pressure. The change implemented here allows the Ray backend to create new partitions when reading paths. The results of such a change are promising across a variety of benchmarking datasets:

Dataset Branch duration (secs)
master fast-im-read
Tabular: Criteo (100MB) 56.88 56.37
Tabular: Criteo (1GB) 289.9 292.55
Image: iSpy2 (~5k rows) 450.62 95.22
Image+Text: Twitter Bots (~43k rows) 1568.34 291.38
Audio: respiratory (~7k rows) N/A (crashed after ~1,980) 308.75
Image+Text: H&M Shopping (~95k rows) N/A 4503.92
A few callouts:

  • If training on Image/Audio features on a Ray/Dask backend, a globally unique index is now explicitly required in order to re-align the partitions across various features. Under certain conditions, preprocessing.build_dataset will automatically reset the indices to ensure they are globally unique.
  • The number of new partitions created during Image/Audio path preprocessing will always be at least min(len(dataset), 200), where 200 is the Ray default number of parallel readers. This number will increase to ensure that the estimated size of the resulting partitions is less than 50MB.
  • An issue with Modin was discovered during implementation. A small workaround on the Ray backend will be reverted once the issue is resolved.

^^resurfacing this

Copy link
Contributor

@ShreyaR ShreyaR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this was a pretty substantial change with a bunch of optimizations throughout image reading! Thanks a lot for getting this in 🎉

Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @geoffreyangus! LGTM 🚢

@geoffreyangus geoffreyangus merged commit 9cd95af into master Aug 8, 2022
@geoffreyangus geoffreyangus deleted the fast-im-read branch August 8, 2022 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants