Enable SequentialReadingService to support MP + Distributed #985

ejguan · 2023-02-03T21:26:24Z

Fixes #911

Changes

Remove distributed code from PrototypeMPRS
Fix a bug of blocking_request_get not sent to worker process
Enable SequentialReadingService to combine both Distributed and MP ReadingService
Add tests for SequentialReadingService
Add tutorial for SequentialReadingService

facebook-github-bot · 2023-02-03T21:26:39Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2023-02-03T21:29:02Z

torchdata/dataloader2/communication/eventloop.py

@@ -92,7 +92,7 @@ def _create_datapipe_queue_loop(source_datapipe, req_queue, res_queue, blocking_
    return pipe_type.DataPipeBehindQueues(
        source_datapipe,
        protocol_type(req_queue, res_queue),
-        blocking_request_get=True,
+        blocking_request_get=blocking_request_get,


This was a bug that would blocking the dispatching process since there will be multiple loops running in the same process. And, we need to make sure each loop won't block each other.
And, it should fix the problem for DPP + MPRS with Fullsync

And, for bullet proof, I have changed both distribtued/non-distributed tests against the non-balanced data shard to guard the dispatching use cases.

wenleix

LGTM.

wenleix · 2023-02-03T21:31:55Z

docs/source/dlv2_tutorial.rst

+Multiprocessing + Distributed
+------------------------------
+
+``SequentialReadingService`` can be used to combine both ``ReadingServices`` together to achive multiprocessing and distributed training at the same time.


Do we expect in the future, it could also be used in OSS to chain disagg reading service + last-mile "on-trainer" Python transformation? :)

I guess so when AIStoreRS or RayRS is provided.

torchdata/dataloader2/communication/iter.py

wenleix · 2023-02-03T21:34:43Z

torchdata/dataloader2/reading_service.py

@@ -236,13 +223,13 @@ def initialize(self, datapipe: DataPipe) -> DataPipe:

        # Launch dispatching process for the lowest common ancestor of non-replicable DataPipes
        graph = traverse_dps(datapipe)
-        non_replicable_dp = find_lca_round_robin_sharding_dp(graph)
-        if non_replicable_dp is not None:
+        dispatching_dp = find_lca_round_robin_sharding_dp(graph)


thanks. it's now more clear which part of dp ~

wenleix · 2023-02-03T22:26:10Z

torchdata/dataloader2/reading_service.py


    def initialize(self, datapipe: DataPipe) -> DataPipe:
        r"""
        ``PrototypeMultiProcessingReadingService`` finds information about sharding,
        separates graph by multiple pieces and reconnects it using queues.
        creates subprocesses.
        """
-        if dist.is_available() and dist.is_initialized():


wenleix · 2023-02-03T22:47:23Z

test/dataloader2/test_dataloader2.py

+    pass
+
+
+def _launch_distributed_training(world_size, *args, fn):


I asked ChatGPT what does this function do:

The program is a function _launch_distributed_training that launches a distributed training process. The function takes in parameters world_size, *args, and fn. The environment variable MASTER_ADDR is set to TEST_MASTER_ADDR, and the environment variable MASTER_PORT is set to a value returned from a call to the _get_open_port function. The function creates a multiprocessing context using the spawn method, and creates a queue q using the context.

The function then creates world_size processes using the Process method of the context and starts each process. The target of each process is the function fn, and the arguments for each process are rank, world_size, q, and *args. The function stores the created processes in a list ps.

The function then uses a while loop to get data from the queue q and append it to a list res. The loop breaks when a TerminateSignal is received from the queue. After the loop, the function joins all processes in the ps list. Finally, the function returns the res list.

Seems quite correct? ~

what happens if we give the description back to ChatGPT and ask it to write a code~~ "write a function that (quote~)"

Haha, writing those comments would take me more time than writing code.

NivekT

LGTM!

torchdata/dataloader2/communication/iter.py

NivekT · 2023-02-03T23:04:40Z

docs/source/dlv2_tutorial.rst

@@ -50,3 +50,21 @@ Distributed
        for d in dl:
            model(d)
    dl.shutdown()
+
+Multiprocessing + Distributed


Non-blocking but it will be nice to add this to our Colab example as well!

Sounds reasonable to me.

facebook-github-bot · 2023-02-06T15:44:10Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-02-06T16:14:22Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-02-06T18:29:03Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-02-07T00:53:35Z

@ejguan merged this pull request in 89be152.

ejguan added 3 commits February 3, 2023 21:15

Remove dist code from ProtoMPRS

5badea9

Add SequentialRS tests and fix problems

9bf1376

Add sequentialRS tutorial

a3495c6

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2023

ejguan requested review from NivekT and wenleix February 3, 2023 21:26

ejguan commented Feb 3, 2023

View reviewed changes

wenleix mentioned this pull request Feb 3, 2023

Start to graduate PrototypeMultiProcessingReadingService from "prototype mode" #932

Closed

wenleix approved these changes Feb 3, 2023

View reviewed changes

NivekT approved these changes Feb 3, 2023

View reviewed changes

Keep discard requests in reset_epoch

ee7c5b3

Remove distributed test with ProtMPRS and add comments

e9206fe

remove partial form test file

4354d57

facebook-github-bot closed this in 89be152 Feb 7, 2023

facebook-github-bot added the Merged label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable SequentialReadingService to support MP + Distributed #985

Enable SequentialReadingService to support MP + Distributed #985

ejguan commented Feb 3, 2023

facebook-github-bot commented Feb 3, 2023

ejguan Feb 3, 2023

ejguan Feb 6, 2023

wenleix left a comment

wenleix Feb 3, 2023

ejguan Feb 4, 2023

wenleix Feb 3, 2023

wenleix Feb 3, 2023

wenleix Feb 3, 2023

dracifer Feb 3, 2023

ejguan Feb 4, 2023

NivekT left a comment

NivekT Feb 3, 2023

ejguan Feb 4, 2023

facebook-github-bot commented Feb 6, 2023

facebook-github-bot commented Feb 6, 2023

facebook-github-bot commented Feb 6, 2023

facebook-github-bot commented Feb 7, 2023

		pass


		def _launch_distributed_training(world_size, *args, fn):

Enable SequentialReadingService to support MP + Distributed #985

Enable SequentialReadingService to support MP + Distributed #985

Conversation

ejguan commented Feb 3, 2023

Changes

facebook-github-bot commented Feb 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NivekT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 6, 2023

facebook-github-bot commented Feb 6, 2023

facebook-github-bot commented Feb 6, 2023

facebook-github-bot commented Feb 7, 2023