[Data] Support Exact Batch Size Enforcement in map_batches to Enable Collate Functions in Ray Data Pipeline

### Description

In Ray Data, the `collate_fn` currently must run on the iterator/consumer side because `map_batches` does not guarantee that each batch passed to the function has an exact size equal to `batch_size`. This prevents users from reliably performing collation within the Ray Data pipeline itself.

Because of this limitation, all collation logic must be handled on the iterator side, which reduces scalability and prevents fully utilizing Ray Data’s distributed data processing capabilities.

### Desired Behavior

Enable `map_batches` that guarantees exact batch sizes, i.e.:
	•	Each batch sent to the function must be exactly batch_size
	•	Only the last one block can be less than the batch_size

With this guarantee, the collate_fn can be moved fully into the Ray Data pipeline.

### Use case

* Collation can be CPU-heavy; distributing it across Ray workers greatly improves scalability.
* Allows moving logic from the less scalable iterator layer into the Ray Data pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Support Exact Batch Size Enforcement in map_batches to Enable Collate Functions in Ray Data Pipeline #58837

Description

Desired Behavior

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] Support Exact Batch Size Enforcement in map_batches to Enable Collate Functions in Ray Data Pipeline #58837

Description

Description

Desired Behavior

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions