Skip to content

[Data] Support Exact Batch Size Enforcement in map_batches to Enable Collate Functions in Ray Data Pipeline #58837

@xinyuangui2

Description

@xinyuangui2

Description

In Ray Data, the collate_fn currently must run on the iterator/consumer side because map_batches does not guarantee that each batch passed to the function has an exact size equal to batch_size. This prevents users from reliably performing collation within the Ray Data pipeline itself.

Because of this limitation, all collation logic must be handled on the iterator side, which reduces scalability and prevents fully utilizing Ray Data’s distributed data processing capabilities.

Desired Behavior

Enable map_batches that guarantees exact batch sizes, i.e.:
• Each batch sent to the function must be exactly batch_size
• Only the last one block can be less than the batch_size

With this guarantee, the collate_fn can be moved fully into the Ray Data pipeline.

Use case

  • Collation can be CPU-heavy; distributing it across Ray workers greatly improves scalability.
  • Allows moving logic from the less scalable iterator layer into the Ray Data pipeline.

Metadata

Metadata

Assignees

Labels

dataRay Data-related issuesenhancementRequest for new feature and/or capabilityperformancetriageNeeds triage (eg: priority, bug/not-bug, and owning component)usability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions