[Data] Add configurable batching for resolve_block_refs to speed up iter_batches #58467

YoussefEssDS · 2025-11-08T00:22:18Z

Description

This PR will:

Batch block resolution in resolve_block_refs() so iter_batches() issues one ray.get() per chunk of block refs instead of per ref. The chunk size is configurable using new DataContext.iter_get_block_batch_size knob.
Added a test that proves that resolve_block_refs() actually batches the ray.get() calls.

Related issues

Raised by @amogkam in python/ray/data/_internal/block_batching/util.py

Additional information

Simple benchmark available: https://gist.github.com/YoussefEssDS/40de959a42a19334b8dac8bd217c319b

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

gemini-code-assist

Code Review

This pull request introduces batching for resolve_block_refs to improve the performance of iter_batches by reducing the number of ray.get() calls. The batch size is made configurable through a new DataContext setting. The implementation is sound and includes a good test case to verify the batching behavior. I have one suggestion to improve code conciseness by using yield from.

gemini-code-assist · 2025-11-08T00:23:00Z

python/ray/data/_internal/block_batching/util.py

+    for block_ref in block_ref_iter:
+        pending.append(block_ref)
+        if len(pending) >= batch_size:
+            for block in _resolve_pending():
+                yield block
+
+    for block in _resolve_pending():
        yield block


The logic for yielding blocks from _resolve_pending is duplicated. You can simplify this by using yield from to make the code more concise and avoid repetition.

Suggested change

for block_ref in block_ref_iter:

pending.append(block_ref)

if len(pending) >= batch_size:

for block in _resolve_pending():

yield block

for block in _resolve_pending():

yield block

for block_ref in block_ref_iter:

pending.append(block_ref)

if len(pending) >= batch_size:

yield from _resolve_pending()

yield from _resolve_pending()

python/ray/data/context.py

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

python/ray/data/_internal/block_batching/util.py

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor · 2025-11-11T19:38:35Z

python/ray/data/_internal/block_batching/iter_batches.py

        if batch_size is None or current_window_size < num_rows_to_prefetch:
            try:
-                next_ref_bundle = get_next_ref_bundle()
+                next_ref_bundle = next(ref_bundles)


Bug: RefBundle Retrieval Observability Gap

The removal of the get_next_ref_bundle() helper function eliminates tracking of stats.iter_get_ref_bundles_s timing metrics. The direct calls to next(ref_bundles) at lines 371 and 384 no longer wrap the operation with the stats timer, causing loss of observability for RefBundle retrieval time which was previously tracked and reported in iteration statistics.

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor

Bug: Phantom Constant Breaks Imports

The constant DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR is removed but it's still imported and used in actor_pool_map_operator.py and test_operators.py. This will cause an ImportError when those modules try to import this constant from ray.data.context.

python/ray/data/context.py#L217-L221

ray/python/ray/data/context.py

Lines 217 to 221 in 17d88de

    
           ) 
        
           # Enable per node metrics reporting for Ray Data, disabled by default. 
        
           DEFAULT_ENABLE_PER_NODE_METRICS = bool( 
        
               int(os.environ.get("RAY_DATA_PER_NODE_METRICS", "0")) 
        
           )

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

srinathk10

@YoussefEssDS Motivation for the changes look good. Please address review comments.

Also w.r.t your micro-benchmark, please add the results as comment here describing your test setup.

python/ray/data/tests/block_batching/test_util.py

python/ray/data/_internal/block_batching/iter_batches.py

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor · 2025-11-16T17:14:51Z

python/ray/data/_internal/block_batching/iter_batches.py

+        self._eager_free = clear_block_after_read and ctx.eager_free
+        max_get_blocks_batch_size = max(1, (prefetch_batches or 0) + 1)
+        self._block_get_batch_size = min(
+            ctx.iter_get_block_batch_size, max_get_blocks_batch_size


Bug: Overly Conservative Batching Limits Performance

The calculation of _block_get_batch_size overly restricts batching by limiting it to prefetch_batches + 1 blocks. With default settings (prefetch_batches=1, iter_get_block_batch_size=32), this results in batching only 2 blocks at a time instead of the configured 32, significantly reducing the performance benefit. The formula max(1, (prefetch_batches or 0) + 1) creates a cap that's too conservative since prefetch_batches measures batches (not blocks), and their relationship varies with block size. This causes the configured iter_get_block_batch_size to be silently overridden in most cases.

Relaxing that cap breaks the backpressure tests, as it forces the materialization of more blocks than the configured prefetch size.

YoussefEssDS · 2025-11-16T17:46:20Z

Hi @srinathk10 thanks for the review. I ran the microbenchmark on a ryzen 9 7950X / 64 GB RAM machine (Ubuntu 22.04, Python 3.12)

python resolve_block_refs_benchmark.py \
  --num-rows 5_000_000 \
  --num-blocks 512 \
  --batch-size 1024 \
  --prefetch-batches 32 \
  --repetitions 3

Before batching change: mean 3.82 s (p50 3.83 s, min 3.78 s, max 3.86 s) = 1.31 M rows/s over 4 883 batches.
After batching change: mean 3.63 s (p50 3.64 s, min 3.61 s, max 3.64 s) =1.38 M rows/s over 4 883 batches.

Net improvement ~5% in end-to-end batch iteration throughput with prefetch set to 32. Both runs used the same script and dataset parameters on the same machine.

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

python/ray/data/_internal/block_batching/util.py

srinathk10 · 2025-11-19T18:53:03Z

Train release tests: https://buildkite.com/ray-project/release/builds/67245

srinathk10

LGTM

raulchen · 2025-11-19T19:35:59Z

python/ray/data/_internal/block_batching/iter_batches.py

-            clear_block_after_read and DataContext.get_current().eager_free
+        ctx = DataContext.get_current()
+        self._eager_free = clear_block_after_read and ctx.eager_free
+        max_get_blocks_batch_size = max(1, (prefetch_batches or 0) + 1)


prefetch_batches is the number of batches to prefetch, not blocks.
The actual number of blocks to prefetch is calculated in BlockPrefecther.
We can add a method to let it report the number of blocks being prefetched.

raulchen · 2025-11-19T19:37:00Z

python/ray/data/_internal/block_batching/util.py

-        hits += current_hit
-        misses += current_miss
-        unknowns += current_unknown
+    ctx = ray.data.context.DataContext.get_current()


pass in the correct context object.
avoid using the global one.

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS · 2025-11-25T21:27:35Z

hi @raulchen is this what you had in mind? any further suggestions?

YoussefEssDS · 2025-12-03T04:25:25Z

Hi @raulchen PTAL, Thanks!

raulchen · 2025-12-03T22:11:42Z

python/ray/data/_internal/block_batching/util.py

    block_ref_iter: Iterator[ObjectRef[Block]],
    stats: Optional[DatasetStats] = None,
+    max_get_batch_size: Optional[Union[int, Callable[[], int]]] = None,
+    ctx: Optional["DataContext"] = None,


can we make this mandatory?

raulchen · 2025-12-03T22:13:05Z

python/ray/data/_internal/block_batching/iter_batches.py

-        self._eager_free = (
-            clear_block_after_read and DataContext.get_current().eager_free
-        )
+        self._ctx = DataContext.get_current()


ideally this ctx should be passed from Dataset._context.
but since it's an existing issue, you can leave a TODO here if it requires a massive change.

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS · 2025-12-04T03:39:04Z

@raulchen PTAL. thanks!

YoussefEssDS · 2025-12-08T19:03:29Z

Hi @raulchen , just bumping this. Can you check if any further changes are needed? Thanks!

YoussefEssDS · 2025-12-15T15:41:27Z

Hi @bveeramani can we get this over the line? It's approved by the reviewers. Thanks!

bveeramani · 2025-12-15T17:29:31Z

@YoussefEssDS merged. Thank you for the contribution!

@amogkam

…ter_batches (ray-project#58467) ## Description This PR will: - Batch block resolution in `resolve_block_refs()` so `iter_batches()` issues one `ray.get()` per chunk of block refs instead of per ref. The chunk size is configurable using new `DataContext.iter_get_block_batch_size` knob. - Added a test that proves that `resolve_block_refs()` actually batches the `ray.get()` calls. ## Related issues Raised by @amogkam in `python/ray/data/_internal/block_batching/util.py` ## Additional information Simple benchmark available: https://gist.github.com/YoussefEssDS/40de959a42a19334b8dac8bd217c319b --------- Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com> Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>

@amogkam

…ter_batches (ray-project#58467) ## Description This PR will: - Batch block resolution in `resolve_block_refs()` so `iter_batches()` issues one `ray.get()` per chunk of block refs instead of per ref. The chunk size is configurable using new `DataContext.iter_get_block_batch_size` knob. - Added a test that proves that `resolve_block_refs()` actually batches the `ray.get()` calls. ## Related issues Raised by @amogkam in `python/ray/data/_internal/block_batching/util.py` ## Additional information Simple benchmark available: https://gist.github.com/YoussefEssDS/40de959a42a19334b8dac8bd217c319b --------- Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

@amogkam

…ter_batches (ray-project#58467) ## Description This PR will: - Batch block resolution in `resolve_block_refs()` so `iter_batches()` issues one `ray.get()` per chunk of block refs instead of per ref. The chunk size is configurable using new `DataContext.iter_get_block_batch_size` knob. - Added a test that proves that `resolve_block_refs()` actually batches the `ray.get()` calls. ## Related issues Raised by @amogkam in `python/ray/data/_internal/block_batching/util.py` ## Additional information Simple benchmark available: https://gist.github.com/YoussefEssDS/40de959a42a19334b8dac8bd217c319b --------- Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

Batch ray.get calls in resolve_block_refs()

1e38e51

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS requested a review from a team as a code owner November 8, 2025 00:22

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

cursor bot reviewed Nov 8, 2025

View reviewed changes

python/ray/data/context.py Outdated Show resolved Hide resolved

Use yield from from & clean up

2023979

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 8, 2025

Fix lint issue

c3cfdc1

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor bot reviewed Nov 8, 2025

View reviewed changes

python/ray/data/_internal/block_batching/util.py Show resolved Hide resolved

Cap batched ray.get size by prefetch window and add override test

48e31bb

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor bot reviewed Nov 11, 2025

View reviewed changes

YoussefEssDS and others added 2 commits November 11, 2025 15:17

Use get_next_ref_bundle and clean up

4725ebd

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

Merge branch 'master' into iter-batch-clean

17d88de

cursor bot reviewed Nov 11, 2025

View reviewed changes

Cleanup

e20b504

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

bveeramani requested review from raulchen and srinathk10 November 13, 2025 00:13

srinathk10 reviewed Nov 13, 2025

View reviewed changes

python/ray/data/tests/block_batching/test_util.py Show resolved Hide resolved

python/ray/data/_internal/block_batching/iter_batches.py Outdated Show resolved Hide resolved

Fix the name and document tests

89d2ab4

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor bot reviewed Nov 16, 2025

View reviewed changes

YoussefEssDS added 2 commits November 16, 2025 21:51

Re-trigger CI

aa349a1

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

Re-trigger CI

6fe7705

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cursor bot reviewed Nov 17, 2025

View reviewed changes

python/ray/data/_internal/block_batching/util.py Show resolved Hide resolved

Merge branch 'master' into iter-batch-clean

30cd108

srinathk10 approved these changes Nov 19, 2025

View reviewed changes

raulchen reviewed Nov 19, 2025

View reviewed changes

YoussefEssDS added 2 commits November 22, 2025 19:37

Use prefetched block counts to cap iter_batches ray.get

fe2023f

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

cleanup

9fef119

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

YoussefEssDS requested a review from raulchen November 23, 2025 03:03

raulchen approved these changes Dec 3, 2025

View reviewed changes

YoussefEssDS and others added 2 commits December 3, 2025 19:39

Require DataContext in resolve_block_refs and note dataset context TODO

cca235f

Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

Merge branch 'master' into iter-batch-clean

8a27b88

bveeramani enabled auto-merge (squash) December 15, 2025 17:29

github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 15, 2025

bveeramani merged commit 2a042d4 into ray-project:master Dec 15, 2025
8 checks passed

	)
	# Enable per node metrics reporting for Ray Data, disabled by default.
	DEFAULT_ENABLE_PER_NODE_METRICS = bool(
	int(os.environ.get("RAY_DATA_PER_NODE_METRICS", "0"))
	)

[Data] Add configurable batching for resolve_block_refs to speed up iter_batches #58467

[Data] Add configurable batching for resolve_block_refs to speed up iter_batches #58467

Conversation

YoussefEssDS commented Nov 8, 2025

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot Nov 11, 2025

Choose a reason for hiding this comment

Bug: RefBundle Retrieval Observability Gap

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Phantom Constant Breaks Imports

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot Nov 16, 2025

Choose a reason for hiding this comment

Bug: Overly Conservative Batching Limits Performance

Uh oh!

YoussefEssDS Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

YoussefEssDS commented Nov 16, 2025

Uh oh!

Uh oh!

srinathk10 commented Nov 19, 2025

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

raulchen Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

raulchen Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

YoussefEssDS commented Nov 25, 2025

Uh oh!

YoussefEssDS commented Dec 3, 2025

Uh oh!

raulchen Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

raulchen Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

YoussefEssDS commented Dec 4, 2025

Uh oh!

YoussefEssDS commented Dec 8, 2025

Uh oh!

YoussefEssDS commented Dec 15, 2025

Uh oh!

bveeramani commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants