[Train] [Data][Doc] Scaling out expensive collation functions doc #58993

xinyuangui2 · 2025-11-26T03:14:22Z

Add instructions on using

ds.repartition(target_num_rows=batch_size).map_batches(collate_fn, batch_size=batch_size)
ds.map_batches(collate_fn, batch_size=batch_size).repartition(target_num_rows=batch_size)

to scale out the collate function inside ray data.

Docs for #58837

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

gemini-code-assist

Code Review

This pull request adds a new user guide on an important performance optimization: moving the collate_fn from Ray Train workers to Ray Data. The documentation is comprehensive and well-structured, with a clear explanation of the problem, solution, and a complete runnable example.

I've identified a few areas for improvement in the provided code examples:

A recurring typo in a variable name.
An inefficient and likely incorrect tensor deserialization method in the utility class.
An overly complex function for mock data generation that could be simplified for better readability.

These changes will improve the clarity and correctness of the example code for users.

doc/source/train/user-guides/move-collate-to-data.rst

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

doc/source/train/user-guides/scaling-collation-functions.rst

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

richardliaw · 2025-11-26T21:01:39Z

doc/source/train/user-guides.rst

    user-guides/fault-tolerance
    user-guides/monitor-your-application
    user-guides/reproducibility
+    user-guides/move-collate-to-data


this should probably go into the training ingest section?

https://docs.ray.io/en/latest/train/user-guides/data-loading-preprocessing.html

This doc is long so I linked inside https://docs.ray.io/en/latest/train/user-guides/data-loading-preprocessing.html

doc/source/train/user-guides/move-collate-to-data.rst

richardliaw · 2025-11-26T21:04:27Z

doc/source/train/user-guides/scaling-collation-functions.rst

+.. testcode::
+    :skipif: True
+
+    from dataclasses import dataclass
+    from typing import Dict, List, Tuple, Union
+    import torch
+    from ray import cloudpickle as pickle
+    import pyarrow as pa
+
+    # (dtype, shape, offset)
+    FEATURE_TYPE = Tuple[torch.dtype, torch.Size, int]
+    TORCH_BYTE_ELEMENT_TYPE = torch.uint8
+
+    def _create_binary_array_from_buffer(buffer: bytes) -> pa.BinaryArray:
+        """Zero-copy create a binary array from a buffer."""
+        data_buffer = pa.py_buffer(buffer)
+        return pa.Array.from_buffers(
+            pa.binary(),
+            1,
+            [
+                None,
+                pa.array([0, data_buffer.size], type=pa.int32()).buffers()[1],
+                data_buffer,
+            ],
+        )
+
+    @dataclass
+    class _Metadata:
+        features: Dict[str, List[FEATURE_TYPE]]
+        total_buffer_size: int
+
+    @dataclass
+    class _TensorBatch:


we don't plan to provide these as out of the box?

I moved to advanced section as users might have their own way.

doc/source/train/user-guides/move-collate-to-data.rst

Signed-off-by: xgui <xgui@anyscale.com>

richardliaw · 2025-12-09T02:28:27Z

doc/source/train/user-guides/scaling-collation-functions.rst

+    class CollateFnRayData(ArrowBatchCollateFn):
+        def __init__(self):
+            self.tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+
+        def __call__(self, batch: pa.Table) -> Dict[str, np.ndarray]:
+            results = self.tokenizer(
+                batch["text"].to_pylist(),
+                truncation=True,
+                padding="longest",
+                return_tensors="np",
+            )
+            results["labels"] = np.array(batch["label"])
+            return results
+


asking a couple questions while I rewrite this - do you have to inherit from ArrowBatchCollateFn? What does it do?

This tells the iterator that this function receives pyarrow.Table as input.

ray/python/ray/data/collate_fn.py

Lines 173 to 179 in 1180868

@DeveloperAPI

class ArrowBatchCollateFn(CollateFn["pyarrow.Table"]):

"""Collate function that takes pyarrow.Table as the input batch type.

Arrow tables with chunked arrays can be efficiently transferred to GPUs without

combining the chunks with the `arrow_batch_to_tensors` utility function.

See `DefaultCollateFn` for example.

"""

Yeah, but how is the iterator aware of this if you move it into the map_batches operator?

doc/source/train/user-guides/scaling-collation-functions.rst

richardliaw · 2025-12-09T02:29:50Z

doc/source/train/user-guides/scaling-collation-functions.rst

+
+    import random
+    import string
+    import ray
+
+    def random_text(length: int) -> str:
+        """Generate random text of specified length."""
+        if length <= 0:
+            return ""
+
+        if length <= 3:
+            return "".join(random.choices(string.ascii_lowercase, k=length))
+
+        words = []
+        current_length = 0
+
+        while current_length < length:
+            remaining = length - current_length
+
+            if remaining <= 4:
+                word_length = remaining
+                word = "".join(random.choices(string.ascii_lowercase, k=word_length))
+                words.append(word)
+                break
+            else:
+                max_word_length = min(10, remaining - 1)
+                if max_word_length >= 3:
+                    word_length = random.randint(3, max_word_length)
+                else:
+                    word_length = remaining
+                word = "".join(random.choices(string.ascii_lowercase, k=word_length))
+                words.append(word)
+                current_length += len(word) + 1
+
+        text = " ".join(words)
+        return text[:length]
+
+    def random_label() -> int:
+        """Pick a random label."""
+        labels = [0, 1, 2, 3, 4, 5, 6, 7]
+        return random.choice(labels)
+
+    def create_mock_ray_text_dataset(dataset_size: int = 96, min_len: int = 5, max_len: int = 100):
+        """Create a mock Ray dataset with random text and labels."""
+        numbers = random.choices(range(min_len, max_len + 1), k=dataset_size)
+        ray_dataset = ray.data.from_items(numbers)
+
+        def map_to_text_and_label(item):
+            length = item['item']
+            text = random_text(length)
+            label = random_label()
+            return {
+                "length": length,
+                "text": text,
+                "label": label
+            }
+
+        text_dataset = ray_dataset.map(map_to_text_and_label)
+        return text_dataset


can we just hide this as a utility that the users can look at, instead of displaying it in the docs?

richardliaw · 2025-12-09T02:30:23Z

doc/source/train/user-guides/scaling-collation-functions.rst

+.. testcode::
+    :skipif: True
+
+    from dataclasses import dataclass
+    from typing import Dict, List, Tuple, Union
+    import torch
+    from ray import cloudpickle as pickle
+    import pyarrow as pa
+
+    # (dtype, shape, offset)
+    FEATURE_TYPE = Tuple[torch.dtype, torch.Size, int]
+    TORCH_BYTE_ELEMENT_TYPE = torch.uint8
+
+    def _create_binary_array_from_buffer(buffer: bytes) -> pa.BinaryArray:
+        """Zero-copy create a binary array from a buffer."""
+        data_buffer = pa.py_buffer(buffer)
+        return pa.Array.from_buffers(
+            pa.binary(),
+            1,
+            [
+                None,
+                pa.array([0, data_buffer.size], type=pa.int32()).buffers()[1],
+                data_buffer,
+            ],
+        )
+
+    @dataclass
+    class _Metadata:
+        features: Dict[str, List[FEATURE_TYPE]]
+        total_buffer_size: int
+
+    @dataclass
+    class _TensorBatch:


this entire section, can we just hide this as a utility that the users can look at, instead of displaying it in the docs? like just link to it

you should be able to put it in doc_code

Ok I hide it now.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: xgui <xgui@anyscale.com>

…ate-fn-doc Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

doc/source/train/user-guides/scaling-collation-functions.rst

Signed-off-by: xgui <xgui@anyscale.com>

richardliaw · 2025-12-10T07:14:15Z

Two last things to do (for rliaw):

there is some bullets that don't render properly

Create a custom collate function that runs in Ray Data and use ray.data.Dataset.map_batches() to scale it out. 3. Use ray.data.Dataset.repartition() to ensure the batch size alignment.

i want to better include this into the training ingest user guide.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

…ate-fn-doc

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

xinyuangui2 and others added 7 commits November 17, 2025 16:47

Avoid lock if serialization result is cached

de4f17f

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Merge branch 'ray-project:master' into master

003b4ab

Merge branch 'ray-project:master' into master

93ab9d2

Merge branch 'ray-project:master' into master

e2cd6b8

Merge branch 'ray-project:master' into master

136ec12

collate to ray data doc

5650791

Signed-off-by: xgui <xgui@anyscale.com>

collate to ray data doc

bc53cb7

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested review from a team as code owners November 26, 2025 03:14

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue data Ray Data-related issues labels Nov 26, 2025

xinyuangui2 and others added 2 commits November 26, 2025 10:08

Update doc/source/train/user-guides/move-collate-to-data.rst

5651ab1

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Update doc/source/train/user-guides/move-collate-to-data.rst

0f9e445

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

cursor bot reviewed Nov 26, 2025

View reviewed changes

doc/source/train/user-guides/scaling-collation-functions.rst Outdated Show resolved Hide resolved

xinyuangui2 and others added 5 commits November 26, 2025 12:12

Update doc/source/train/user-guides/move-collate-to-data.rst

a1d1b39

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Update doc/source/train/user-guides/move-collate-to-data.rst

1273d7e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

resolve comments

b3c9bfd

Signed-off-by: xgui <xgui@anyscale.com>

remove some redundancies

e9c523b

Signed-off-by: xgui <xgui@anyscale.com>

fix

572a215

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 changed the title ~~[Train] [Data] Collate fn doc~~ [Train] [Data] Collate_fn_to_ray_data doc Nov 26, 2025

richardliaw reviewed Nov 26, 2025

View reviewed changes

doc/source/train/user-guides/move-collate-to-data.rst Outdated Show resolved Hide resolved

xinyuangui2 changed the title ~~[Train] [Data] Collate_fn_to_ray_data doc~~ [Train] [Data][Doc] Collate_fn_to_ray_data doc Nov 26, 2025

richardliaw reviewed Nov 26, 2025

View reviewed changes

xinyuangui2 and others added 2 commits December 3, 2025 10:10

Merge branch 'master' into collate-fn-doc

e06c519

apply the new change

ec8da4b

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 changed the title ~~[Train] [Data][Doc] Collate_fn_to_ray_data doc~~ [Train] [Data][Doc] Scaling out expensive collation functions doc Dec 3, 2025

add one link

b5b76ec

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from richardliaw December 3, 2025 23:50

xinyuangui2 mentioned this pull request Dec 4, 2025

[Data] Support Exact Batch Size Enforcement in map_batches to Enable Collate Functions in Ray Data Pipeline #58837

Closed

xinyuangui2 assigned richardliaw Dec 4, 2025

xinyuangui2 added the go add ONLY when ready to merge, run all tests label Dec 4, 2025

Merge branch 'master' into collate-fn-doc

1e961da

richardliaw reviewed Dec 9, 2025

View reviewed changes

doc/source/train/user-guides/scaling-collation-functions.rst Outdated Show resolved Hide resolved

richardliaw reviewed Dec 9, 2025

View reviewed changes

richardliaw and others added 8 commits December 8, 2025 18:47

adjusted-rliaw

447075c

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

resolve comments

b83cc2f

Signed-off-by: xgui <xgui@anyscale.com>

hide the utils

d94f9ee

Signed-off-by: xgui <xgui@anyscale.com>

Merge branch 'collate-fn-doc' of github.com:xinyuangui2/ray into coll…

c0df121

…ate-fn-doc Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

ok

d469462

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update

d58df9a

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update

521b51b

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Merge branch 'master' into collate-fn-doc

8d825a1

cursor bot reviewed Dec 10, 2025

View reviewed changes

doc/source/train/user-guides/scaling-collation-functions.rst Outdated Show resolved Hide resolved

quick fix

022a035

Signed-off-by: xgui <xgui@anyscale.com>

richardliaw added 3 commits December 10, 2025 15:38

fix

54f6c9a

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Merge branch 'collate-fn-doc' of github.com:xinyuangui2/ray into coll…

8273928

…ate-fn-doc

update

425c656

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw approved these changes Dec 11, 2025

View reviewed changes

richardliaw merged commit 4e4fd20 into ray-project:master Dec 11, 2025
6 checks passed

	@DeveloperAPI
	class ArrowBatchCollateFn(CollateFn["pyarrow.Table"]):
	"""Collate function that takes pyarrow.Table as the input batch type.
	Arrow tables with chunked arrays can be efficiently transferred to GPUs without
	combining the chunks with the `arrow_batch_to_tensors` utility function.
	See `DefaultCollateFn` for example.
	"""

[Train] [Data][Doc] Scaling out expensive collation functions doc #58993

[Train] [Data][Doc] Scaling out expensive collation functions doc #58993

Conversation

xinyuangui2 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

richardliaw commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xinyuangui2 commented Nov 26, 2025 •

edited

Loading

xinyuangui2 Dec 3, 2025 •

edited

Loading