-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DistribtuedReadingService #727
Conversation
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Just to be clear, does the first version assumes 1 process per rank? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about the distributed design in general:
- Are we expecting users to always use something like
torch.multiprocessing.spawn
to start distributed training? And that will properly start/clean up all the processes? - To what extent is the optimization from Second prototype to do pre-sharding work in single process #555 compatible with this? Maybe it can work for every node?
Yeah. I think our next step is to support mixed reading service (distributed reading service + multiprocessing reading service) |
Nope. I can add a different test for elastic training. |
This PR should be ready to review. cc: @NivekT @VitalyFedyunin |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Actually, there is one thing left, which is doc and tutorial for DistributedReadingService. I will do a separate PR to document DataLoader2 and DistributedReadingService. |
I will wait until pytorch/pytorch#83741 is landed and released into nightly because it will also use the updated API. |
e4cd5fe
to
43b06eb
Compare
Failing DataPipe tests because the nightly binaries for mac haven't been updated yet. |
bb5de07
to
8ec5764
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
nit comment:
IIUC, the difference between this and _test_distributed_rs
is DL1 vs DL2?
- If so, we can potential remove the duplicate code (not urgent).
- Alternatively, we can just label
_test_distributed_dl
and_test_distributed_dataloader
(andelastic_dl
/elastic_training
) so it will be obvious from the first glance that those two are mostly the same but testing different version of DL.
8ec5764
to
b1e8f35
Compare
Fixed |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
b2ad947
to
8cad29d
Compare
77ad0ef
to
4aa1272
Compare
4aa1272
to
8900311
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
149ce37
to
e4bdcd4
Compare
I will land this PR until PyTorch nightly is updated. |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Add DistributedReadingService
Add tests for both
DataLoader2
andDataLoader
.