Implement DistribtuedReadingService #727

ejguan · 2022-08-11T20:14:33Z

Add DistributedReadingService

Single process
Share shuffle seeds across distributed process
Automatically distributed sharding

Add tests for both DataLoader2 and DataLoader.

Spawn processes
Elastic training

torchdata/dataloader2/reading_service.py

facebook-github-bot · 2022-08-16T16:27:34Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

VitalyFedyunin · 2022-08-16T21:08:45Z

Just to be clear, does the first version assumes 1 process per rank?

NivekT

Question about the distributed design in general:

Are we expecting users to always use something like torch.multiprocessing.spawn to start distributed training? And that will properly start/clean up all the processes?
To what extent is the optimization from Second prototype to do pre-sharding work in single process #555 compatible with this? Maybe it can work for every node?

test/test_distributed.py

torchdata/dataloader2/reading_service.py

ejguan · 2022-08-17T13:19:01Z

Just to be clear, does the first version assumes 1 process per rank?

Yeah. I think our next step is to support mixed reading service (distributed reading service + multiprocessing reading service)

ejguan · 2022-08-17T13:22:56Z

Are we expecting users to always use something like torch.multiprocessing.spawn to start distributed training? And that will properly start/clean up all the processes?

Nope. I can add a different test for elastic training.

ejguan · 2022-08-18T20:17:09Z

This PR should be ready to review. cc: @NivekT @VitalyFedyunin
In terms of multiprocessing + distributed, I will figure out the design of mixed reading services first then add such feature.

facebook-github-bot · 2022-08-18T20:18:01Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2022-08-18T20:24:41Z

Actually, there is one thing left, which is doc and tutorial for DistributedReadingService. I will do a separate PR to document DataLoader2 and DistributedReadingService.

ejguan · 2022-08-19T16:16:22Z

I will wait until pytorch/pytorch#83741 is landed and released into nightly because it will also use the updated API.

ejguan · 2022-09-15T15:04:19Z

Failing DataPipe tests because the nightly binaries for mac haven't been updated yet.

facebook-github-bot · 2022-09-20T17:57:35Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

NivekT

LGTM!

nit comment:
IIUC, the difference between this and _test_distributed_rs is DL1 vs DL2?

If so, we can potential remove the duplicate code (not urgent).
Alternatively, we can just label _test_distributed_dl and _test_distributed_dataloader (and elastic_dl/elastic_training) so it will be obvious from the first glance that those two are mostly the same but testing different version of DL.

test/test_distributed.py

ejguan · 2022-09-20T21:52:48Z

IIUC, the difference between this and _test_distributed_rs is DL1 vs DL2?

If so, we can potential remove the duplicate code (not urgent).

Alternatively, we can just label _test_distributed_dl and _test_distributed_dataloader (and elastic_dl/elastic_training) so it will be obvious from the first glance that those two are mostly the same but testing different version of DL.

Fixed

facebook-github-bot · 2022-09-20T21:53:19Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torchdata/dataloader2/reading_service.py

facebook-github-bot · 2022-09-21T20:04:52Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-09-22T16:35:42Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2022-09-22T18:01:46Z

I will land this PR until PyTorch nightly is updated.

facebook-github-bot · 2022-09-23T14:55:58Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 11, 2022

ejguan force-pushed the distributed_rs branch from 87dd938 to 813e60f Compare August 12, 2022 20:18

ejguan commented Aug 12, 2022

View reviewed changes

torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved

ejguan requested review from VitalyFedyunin and NivekT August 12, 2022 20:20

ejguan force-pushed the distributed_rs branch from 813e60f to 0f715fc Compare August 15, 2022 16:10

NivekT reviewed Aug 16, 2022

View reviewed changes

test/test_distributed.py Show resolved Hide resolved

torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved

torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved

torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved

ejguan force-pushed the distributed_rs branch from 0f715fc to 2568940 Compare August 18, 2022 19:43

ejguan force-pushed the distributed_rs branch 4 times, most recently from e4cd5fe to 43b06eb Compare September 15, 2022 13:55

ejguan force-pushed the distributed_rs branch 3 times, most recently from bb5de07 to 8ec5764 Compare September 19, 2022 19:55

ejguan mentioned this pull request Sep 19, 2022

[DataLoader] Make distributed lazily initialized & share seed via PG pytorch/pytorch#85279

Closed

ejguan requested a review from NivekT September 20, 2022 19:14

NivekT approved these changes Sep 20, 2022

View reviewed changes

test/test_distributed.py Outdated Show resolved Hide resolved

ejguan force-pushed the distributed_rs branch from 8ec5764 to b1e8f35 Compare September 20, 2022 21:52

ejguan force-pushed the distributed_rs branch 3 times, most recently from b2ad947 to 8cad29d Compare September 21, 2022 15:26

VitalyFedyunin reviewed Sep 21, 2022

View reviewed changes

torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved

VitalyFedyunin approved these changes Sep 21, 2022

View reviewed changes

ejguan force-pushed the distributed_rs branch 2 times, most recently from 77ad0ef to 4aa1272 Compare September 21, 2022 17:24

Add distributedReadingService

8900311

ejguan force-pushed the distributed_rs branch from 4aa1272 to 8900311 Compare September 21, 2022 19:47

NivekT approved these changes Sep 21, 2022

View reviewed changes

Fix internal CI failure

e4bdcd4

ejguan force-pushed the distributed_rs branch from 149ce37 to e4bdcd4 Compare September 22, 2022 17:33

facebook-github-bot closed this in 0fed8da Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DistribtuedReadingService #727

Implement DistribtuedReadingService #727

ejguan commented Aug 11, 2022 •

edited

Loading

facebook-github-bot commented Aug 16, 2022

VitalyFedyunin commented Aug 16, 2022

NivekT left a comment

ejguan commented Aug 17, 2022

ejguan commented Aug 17, 2022

ejguan commented Aug 18, 2022

facebook-github-bot commented Aug 18, 2022

ejguan commented Aug 18, 2022 •

edited

Loading

ejguan commented Aug 19, 2022

ejguan commented Sep 15, 2022

facebook-github-bot commented Sep 20, 2022

NivekT left a comment

ejguan commented Sep 20, 2022

facebook-github-bot commented Sep 20, 2022

facebook-github-bot commented Sep 21, 2022

facebook-github-bot commented Sep 22, 2022

ejguan commented Sep 22, 2022

facebook-github-bot commented Sep 23, 2022

Implement DistribtuedReadingService #727

Implement DistribtuedReadingService #727

Conversation

ejguan commented Aug 11, 2022 • edited Loading

facebook-github-bot commented Aug 16, 2022

VitalyFedyunin commented Aug 16, 2022

NivekT left a comment

Choose a reason for hiding this comment

ejguan commented Aug 17, 2022

ejguan commented Aug 17, 2022

ejguan commented Aug 18, 2022

facebook-github-bot commented Aug 18, 2022

ejguan commented Aug 18, 2022 • edited Loading

ejguan commented Aug 19, 2022

ejguan commented Sep 15, 2022

facebook-github-bot commented Sep 20, 2022

NivekT left a comment

Choose a reason for hiding this comment

ejguan commented Sep 20, 2022

facebook-github-bot commented Sep 20, 2022

facebook-github-bot commented Sep 21, 2022

facebook-github-bot commented Sep 22, 2022

ejguan commented Sep 22, 2022

facebook-github-bot commented Sep 23, 2022

ejguan commented Aug 11, 2022 •

edited

Loading

ejguan commented Aug 18, 2022 •

edited

Loading