[Nodes] Add Prebatch setting to ParallelMapper #1417

andrewkho · 2024-12-26T21:56:55Z

When ParallelMapper is used for very cheap operations, the overhead of sending items over queues can quickly add up. This is a nice parameter to be able to tune.

Fixes #1415

A few notes about the implementation:

I chose to compose 3 nodes (Batcher, ParallelMapper, Unbatcher) into one to implement this. This is the first time we're composing BaseNodes with other BaseNodes. This will require us to figure out graph-traversal for these options (see footnote).
this required us to have _ParallelMapperIter implement BaseNode, however getting reset to work correctly is going to be a bigger problem, so for now, just created an intermediate class with basically the current implementation of ParallelMapper, and this allows us to use torchdata.nodes composition to get things working easily.

Test Plan:

Unit tests
Ran a simple script to test this, output:

python examples/nodes/test_prebatch.py
[9999400009, 9999600004, 9999800001]
baseline: dt=3.0651697060093284s
[9999400009, 9999600004, 9999800001]
prebatch=16: dt=0.454918147996068s
[9999400009, 9999600004, 9999800001]
prebatch=256: dt=0.13740589004009962s
[9999400009, 9999600004, 9999800001]
prebatch=1024: dt=0.22711888700723648s

test script:

import time
import torchdata.nodes as tn


def run(prebatch):
    node = tn.IterableWrapper(range(100000))
    node = tn.ParallelMapper(node, map_fn=lambda x: x**2, prebatch=prebatch, method="thread", num_workers=8)
    loader = tn.Loader(node)
    x = list(loader)
    print(x[-3:])


if __name__ == "__main__":
    t0 = time.perf_counter()
    run(None)
    dt = time.perf_counter() - t0
    print(f"baseline: {dt=}s")

    for prebatch in (16, 256, 1024):
        t0 = time.perf_counter()
        run(prebatch)
        dt = time.perf_counter() - t0
        print(f"{prebatch=}: {dt=}s")

Footnote: Example of where this is a problem: In the ParallelMapper case here, traversing the dag with reflection (eg using instance.dict and checking for baseNode instances) would generate two sinks for the source, since self.source points to it, and self._it would eventally point to it as well. One way we could handle this is with an optional "get_source/get_parent" method on BaseNode, which returns the instance of where graph traversal should begin, and in this case it would return self._it, not self.source.

pytorch-bot · 2024-12-26T22:00:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1417

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 8a9ba5b with merge base 62092dd ():

NEW FAILURE - The following job has failed:

Lint / mypy (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

divyanshk · 2024-12-26T22:59:16Z

torchdata/nodes/map.py

@@ -272,6 +281,77 @@ def _shutdown(self):
                    t.join(timeout=QUEUE_TIMEOUT * 5)


+class _ParallelMapperImpl(BaseNode[T]):
+    """This class implements _ParallelMapperIter as a BaseNode, allowing it


Nit: This class implements _ParallelMapperIter and _InlineMapperIter as a BaseNode, ....

add prebatch feature

8a9ba5b

andrewkho requested review from divyanshk and ramanishsingh December 26, 2024 21:56

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 26, 2024

divyanshk reviewed Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nodes] Add Prebatch setting to ParallelMapper #1417

[Nodes] Add Prebatch setting to ParallelMapper #1417

andrewkho commented Dec 26, 2024 •

edited

Loading

pytorch-bot bot commented Dec 26, 2024 •

edited

Loading

divyanshk Dec 26, 2024

[Nodes] Add Prebatch setting to ParallelMapper #1417

Are you sure you want to change the base?

[Nodes] Add Prebatch setting to ParallelMapper #1417

Conversation

andrewkho commented Dec 26, 2024 • edited Loading

pytorch-bot bot commented Dec 26, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1417

❌ 1 New Failure

divyanshk Dec 26, 2024

Choose a reason for hiding this comment

andrewkho commented Dec 26, 2024 •

edited

Loading

pytorch-bot bot commented Dec 26, 2024 •

edited

Loading