Skip to content

Conversation

@xinyuangui2
Copy link
Contributor

@xinyuangui2 xinyuangui2 commented Nov 26, 2025

Add instructions on using

  • ds.repartition(target_num_rows=batch_size).map_batches(collate_fn, batch_size=batch_size)
  • ds.map_batches(collate_fn, batch_size=batch_size).repartition(target_num_rows=batch_size)

to scale out the collate function inside ray data.

Docs for #58837

xinyuangui2 and others added 7 commits November 17, 2025 16:47
The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock.

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 requested review from a team as code owners November 26, 2025 03:14
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new user guide on an important performance optimization: moving the collate_fn from Ray Train workers to Ray Data. The documentation is comprehensive and well-structured, with a clear explanation of the problem, solution, and a complete runnable example.

I've identified a few areas for improvement in the provided code examples:

  • A recurring typo in a variable name.
  • An inefficient and likely incorrect tensor deserialization method in the utility class.
  • An overly complex function for mock data generation that could be simplified for better readability.

These changes will improve the clarity and correctness of the example code for users.

@ray-gardener ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue data Ray Data-related issues labels Nov 26, 2025
xinyuangui2 and others added 2 commits November 26, 2025 10:08
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
xinyuangui2 and others added 5 commits November 26, 2025 12:12
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 changed the title [Train] [Data] Collate fn doc [Train] [Data] Collate_fn_to_ray_data doc Nov 26, 2025
user-guides/fault-tolerance
user-guides/monitor-your-application
user-guides/reproducibility
user-guides/move-collate-to-data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably go into the training ingest section?

https://docs.ray.io/en/latest/train/user-guides/data-loading-preprocessing.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinyuangui2 xinyuangui2 changed the title [Train] [Data] Collate_fn_to_ray_data doc [Train] [Data][Doc] Collate_fn_to_ray_data doc Nov 26, 2025
Comment on lines 179 to 211
.. testcode::
:skipif: True

from dataclasses import dataclass
from typing import Dict, List, Tuple, Union
import torch
from ray import cloudpickle as pickle
import pyarrow as pa

# (dtype, shape, offset)
FEATURE_TYPE = Tuple[torch.dtype, torch.Size, int]
TORCH_BYTE_ELEMENT_TYPE = torch.uint8

def _create_binary_array_from_buffer(buffer: bytes) -> pa.BinaryArray:
"""Zero-copy create a binary array from a buffer."""
data_buffer = pa.py_buffer(buffer)
return pa.Array.from_buffers(
pa.binary(),
1,
[
None,
pa.array([0, data_buffer.size], type=pa.int32()).buffers()[1],
data_buffer,
],
)

@dataclass
class _Metadata:
features: Dict[str, List[FEATURE_TYPE]]
total_buffer_size: int

@dataclass
class _TensorBatch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't plan to provide these as out of the box?

Copy link
Contributor Author

@xinyuangui2 xinyuangui2 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved to advanced section as users might have their own way.

xinyuangui2 and others added 2 commits December 3, 2025 10:10
@xinyuangui2 xinyuangui2 changed the title [Train] [Data][Doc] Collate_fn_to_ray_data doc [Train] [Data][Doc] Scaling out expensive collation functions doc Dec 3, 2025
Signed-off-by: xgui <xgui@anyscale.com>
Comment on lines 199 to 212
class CollateFnRayData(ArrowBatchCollateFn):
def __init__(self):
self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def __call__(self, batch: pa.Table) -> Dict[str, np.ndarray]:
results = self.tokenizer(
batch["text"].to_pylist(),
truncation=True,
padding="longest",
return_tensors="np",
)
results["labels"] = np.array(batch["label"])
return results

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asking a couple questions while I rewrite this - do you have to inherit from ArrowBatchCollateFn? What does it do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells the iterator that this function receives pyarrow.Table as input.

@DeveloperAPI
class ArrowBatchCollateFn(CollateFn["pyarrow.Table"]):
"""Collate function that takes pyarrow.Table as the input batch type.
Arrow tables with chunked arrays can be efficiently transferred to GPUs without
combining the chunks with the `arrow_batch_to_tensors` utility function.
See `DefaultCollateFn` for example.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but how is the iterator aware of this if you move it into the map_batches operator?

Comment on lines 36 to 94

import random
import string
import ray

def random_text(length: int) -> str:
"""Generate random text of specified length."""
if length <= 0:
return ""

if length <= 3:
return "".join(random.choices(string.ascii_lowercase, k=length))

words = []
current_length = 0

while current_length < length:
remaining = length - current_length

if remaining <= 4:
word_length = remaining
word = "".join(random.choices(string.ascii_lowercase, k=word_length))
words.append(word)
break
else:
max_word_length = min(10, remaining - 1)
if max_word_length >= 3:
word_length = random.randint(3, max_word_length)
else:
word_length = remaining
word = "".join(random.choices(string.ascii_lowercase, k=word_length))
words.append(word)
current_length += len(word) + 1

text = " ".join(words)
return text[:length]

def random_label() -> int:
"""Pick a random label."""
labels = [0, 1, 2, 3, 4, 5, 6, 7]
return random.choice(labels)

def create_mock_ray_text_dataset(dataset_size: int = 96, min_len: int = 5, max_len: int = 100):
"""Create a mock Ray dataset with random text and labels."""
numbers = random.choices(range(min_len, max_len + 1), k=dataset_size)
ray_dataset = ray.data.from_items(numbers)

def map_to_text_and_label(item):
length = item['item']
text = random_text(length)
label = random_label()
return {
"length": length,
"text": text,
"label": label
}

text_dataset = ray_dataset.map(map_to_text_and_label)
return text_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just hide this as a utility that the users can look at, instead of displaying it in the docs?

Comment on lines 305 to 337
.. testcode::
:skipif: True

from dataclasses import dataclass
from typing import Dict, List, Tuple, Union
import torch
from ray import cloudpickle as pickle
import pyarrow as pa

# (dtype, shape, offset)
FEATURE_TYPE = Tuple[torch.dtype, torch.Size, int]
TORCH_BYTE_ELEMENT_TYPE = torch.uint8

def _create_binary_array_from_buffer(buffer: bytes) -> pa.BinaryArray:
"""Zero-copy create a binary array from a buffer."""
data_buffer = pa.py_buffer(buffer)
return pa.Array.from_buffers(
pa.binary(),
1,
[
None,
pa.array([0, data_buffer.size], type=pa.int32()).buffers()[1],
data_buffer,
],
)

@dataclass
class _Metadata:
features: Dict[str, List[FEATURE_TYPE]]
total_buffer_size: int

@dataclass
class _TensorBatch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire section, can we just hide this as a utility that the users can look at, instead of displaying it in the docs? like just link to it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to put it in doc_code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I hide it now.

richardliaw and others added 8 commits December 8, 2025 18:47
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
…ate-fn-doc

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: xgui <xgui@anyscale.com>
@richardliaw
Copy link
Contributor

Two last things to do (for rliaw):

  1. there is some bullets that don't render properly
  1. Create a custom collate function that runs in Ray Data and use ray.data.Dataset.map_batches() to scale it out. 3. Use ray.data.Dataset.repartition() to ensure the batch size alignment.
  1. i want to better include this into the training ingest user guide.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw merged commit 4e4fd20 into ray-project:master Dec 11, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants