-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
Description
When using Ray Data's map_batches operation in a pipeline with multiple stages, if two or more consecutive map_batches have identical resource requirements (e.g., both request num_cpus=1), Ray may fuse these stages into a single task. This fusion can be problematic when the stages have very different characteristics—such as one being I/O intensive and the other compute intensive—because:
- Both tasks get scheduled together, leading to inefficient resource usage.
- Autoscaling and scheduling cannot independently optimize for the distinct needs of each stage.
- One slow stage (e.g., I/O) can bottleneck the overall pipeline, even if the next stage could be run in parallel.
Suppose you have:
Stage 1: I/O intensive, originally set to num_cpus=1
Stage 2: Compute intensive, also set to num_cpus=1
Ray may fuse these into a single task, causing the above issues.
Workaround
To prevent fusion and allow each stage to be scheduled and autoscaled independently, assign different resource requirements to each map_batches stage. For example:
- Set the I/O intensive stage to num_cpus=0.5
- Set the compute intensive stage to num_cpus=1.0
This ensures Ray treats each stage as a separate task, enabling better autoscaling and parallelism.
Request
- Feature: Provide a way to explicitly disable task fusion for map_batches, or document this fusion behavior and the recommended workaround.
- Documentation: Clearly explain in the docs how resource requirements affect task fusion, and how to avoid unintended fusion when building multi-stage pipelines with heterogeneous workloads.
- Long Shot - Use
time(completion of map_batches per row)of task to understand which stages can be fused
Use case
No response