Skip to content

[Data] Allow disabling Task Fusion / Documenting how to avoid it #54433

@praateekmahajan

Description

@praateekmahajan

Description

When using Ray Data's map_batches operation in a pipeline with multiple stages, if two or more consecutive map_batches have identical resource requirements (e.g., both request num_cpus=1), Ray may fuse these stages into a single task. This fusion can be problematic when the stages have very different characteristics—such as one being I/O intensive and the other compute intensive—because:

  1. Both tasks get scheduled together, leading to inefficient resource usage.
  2. Autoscaling and scheduling cannot independently optimize for the distinct needs of each stage.
  3. One slow stage (e.g., I/O) can bottleneck the overall pipeline, even if the next stage could be run in parallel.

Suppose you have:

Stage 1: I/O intensive, originally set to num_cpus=1
Stage 2: Compute intensive, also set to num_cpus=1

Ray may fuse these into a single task, causing the above issues.

Workaround

To prevent fusion and allow each stage to be scheduled and autoscaled independently, assign different resource requirements to each map_batches stage. For example:

  1. Set the I/O intensive stage to num_cpus=0.5
  2. Set the compute intensive stage to num_cpus=1.0

This ensures Ray treats each stage as a separate task, enabling better autoscaling and parallelism.

Request

  1. Feature: Provide a way to explicitly disable task fusion for map_batches, or document this fusion behavior and the recommended workaround.
  2. Documentation: Clearly explain in the docs how resource requirements affect task fusion, and how to avoid unintended fusion when building multi-stage pipelines with heterogeneous workloads.
  3. Long Shot - Use time(completion of map_batches per row) of task to understand which stages can be fused

Use case

No response

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticaldataRay Data-related issuesdocsAn issue or change related to documentationenhancementRequest for new feature and/or capabilityperformance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions