-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Closed
Labels
dataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformancetriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)usability
Description
Description
In Ray Data, the collate_fn currently must run on the iterator/consumer side because map_batches does not guarantee that each batch passed to the function has an exact size equal to batch_size. This prevents users from reliably performing collation within the Ray Data pipeline itself.
Because of this limitation, all collation logic must be handled on the iterator side, which reduces scalability and prevents fully utilizing Ray Data’s distributed data processing capabilities.
Desired Behavior
Enable map_batches that guarantees exact batch sizes, i.e.:
• Each batch sent to the function must be exactly batch_size
• Only the last one block can be less than the batch_size
With this guarantee, the collate_fn can be moved fully into the Ray Data pipeline.
Use case
- Collation can be CPU-heavy; distributing it across Ray workers greatly improves scalability.
- Allows moving logic from the less scalable iterator layer into the Ray Data pipeline.
Metadata
Metadata
Assignees
Labels
dataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformancetriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)usability