[Data] Revisiting `OpResourceAllocator` to make data flow explicit #57788

alexeykudinkin · 2025-10-16T06:22:20Z

Description

This change primarily converts OpResourceAllocator APIs to make data flow explicit by exposing required params in the APIs.

Additionally:

Abstracting common methods inside OpResourceAllocator base-class.
Adding allocation to progress bar in verbose mode logging budgets & allocations.
Adding byte-size of all enqueued blocks to the progress bar

Related issues

Types of change

Checklist

Does this PR introduce breaking changes?

Yes ⚠️
No

Testing:

Added/updated tests for my changes
Tested the changes manually
This PR is not tested ❌ (please explain why)

Code Quality:

Signed off every commit (git commit -s)
Ran pre-commit hooks (setup guide)

Documentation:

Updated documentation (if applicable) (contribution guide)
Added new APIs to doc/source/ (if applicable)

Additional context

gemini-code-assist · 2025-10-16T06:22:25Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/ray/data/_internal/execution/backpressure_policy/resource_budget_backpressure_policy.py

bveeramani · 2025-10-20T17:56:59Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

        return self._actor_pool.get_actor_info()

+    def get_max_concurrency_limit(self) -> Optional[int]:
+        return self._actor_pool.max_size() * self._actor_pool.max_actor_concurrency()


Out of scope for this PR since this is an existing issue, but if self._actor_pool.max_size() is float("inf"), I think we'd probably want to return None rather than float("inf") for consistency with the return type

Good call

Looked t/h the code and we need to holistically clean this up (since we define max_size as int)

bveeramani · 2025-10-20T17:58:18Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

        5000.0,
    ]
-    task_completion_time: float = metric_field(
+    task_completion_time_s: float = metric_field(


Do we need to update test_stats.py and the dashboard code after renaming these metrics?

Yep, will do

bveeramani · 2025-10-20T18:06:41Z

python/ray/data/_internal/execution/streaming_executor_state.py

            else 0
        )

        return self._pending_dispatch_input_bundles_count() + internal_queue_size


If we change the internal queue size to represent blocks rather than bundles, then total_enqueued_input_bundles will return incorrect values, and DownstreamCapacityBackpressurePolicy will break.

I think even if we update total_enqueued_input_bundles to represent blocks, we'd still need to update the DownstreamCapacityBackpressurePolicy logic:

ray/python/ray/data/_internal/execution/backpressure_policy/downstream_capacity_backpressure_policy.py

Lines 76 to 79 in 3287523

avg_inputs_per_task = (

output_dependency.metrics.num_task_inputs_processed

/ max(output_dependency.metrics.num_tasks_finished, 1)

)

Yeah, we need to fix that across the board

bveeramani · 2025-10-20T18:07:30Z

python/ray/data/_internal/execution/operators/base_physical_operator.py

-        """Returns Operator's internal queue size"""
+        """Returns Operator's internal queue size (in blocks)"""
+        ...


What are we hoping to achieve by changing the unit of internal_queue_size from bundles to blocks?

I just realized that we're assuming that every bundle holds just 1 block, which is not enforced

python/ray/data/_internal/execution/streaming_executor_state.py

python/ray/data/_internal/execution/resource_manager.py

bveeramani · 2025-10-20T18:22:09Z

python/ray/data/_internal/execution/resource_manager.py

+    def __init__(self, topology: "Topology"):
+        self._topology = topology
+        self._idle_detector = self.IdleDetector()
+        self._ticker = 0


I know this is updated in update_budgets, but is it used anywhere else?

Are subclasses required to increment this? If so, I think this should be an explicit part of the interface

Missed to clean up

bveeramani · 2025-10-20T18:30:24Z

python/ray/data/_internal/execution/resource_manager.py

+    @abstractmethod
+    def can_submit_new_task(self, op: PhysicalOperator) -> bool:
+        """Return whether the given operator can submit a new task."""
+        ...


What's the motivation for copying this from the backpressure policy interface to here? Would the implementation ever be non-trivial?

If the implementation of this method is always going to be like below, it might be better to remove the method to make the OpResourceAllocator interface deeper and simpler

def can_submit_new_task(self, op): return op.incremental_resource_usage().satisfies_limit(budget)

Idea here is that the logic whether task can be scheduled should live w/ Resource Allocator (it will be more complicated than the one you referred above)

bveeramani

Looks reasonable to me.

Let's merge #58030 first to minimize size of the diff, and then merge this one?

bveeramani · 2025-10-24T22:20:32Z

python/ray/tests/test_runtime_env_working_dir.py

    @ray.remote
    def test_import():
        import file_module
+


Signed-off-by: Alexey Kudinkin <ak@anyscale.com> # Conflicts: # python/ray/data/_internal/execution/streaming_executor_state.py Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com> # Conflicts: # python/ray/data/tests/test_autoscaler.py Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

cursor · 2025-10-27T20:40:22Z

python/ray/data/_internal/execution/resource_manager.py

+            op,
+            task_resource_usage=self._op_usages,
+            output_object_store_usage=self._mem_op_outputs,
+        )


Bug: Inconsistent Return Types in Resource Management

Type mismatch bug: ResourceManager.max_task_output_bytes_to_read() declares return type as int but calls self._op_resource_allocator.max_task_output_bytes_to_read() which returns Optional[int]. The abstract method in OpResourceAllocator and its implementation in ReservationOpResourceAllocator can return None, but the wrapper method signature promises to always return int. This will cause runtime type errors when None is returned but an int is expected by callers.

…ay-project#57788)    ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [ ] No  **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…ay-project#57788)    ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [ ] No  **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

alexeykudinkin requested a review from a team as a code owner October 16, 2025 06:22

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 16, 2025

ray-gardener bot added the data Ray Data-related issues label Oct 16, 2025

alexeykudinkin changed the title ~~[WIP][Data] Cleaning up OpResourceAllocator APIs~~ [Data] Cleaning up OpResourceAllocator APIs Oct 17, 2025

alexeykudinkin changed the title ~~[Data] Cleaning up OpResourceAllocator APIs~~ [Data] Revisiting OpResourceAllocator to make data flow explicit Oct 17, 2025

bveeramani reviewed Oct 20, 2025

View reviewed changes

alexeykudinkin force-pushed the ak/res-mngr-clup branch from e271ecb to e763d0c Compare October 22, 2025 23:15

This comment was marked as outdated.

Sign in to view

alexeykudinkin force-pushed the ak/res-mngr-clup branch from 5b23ecb to 9d757b7 Compare October 23, 2025 05:02

alexeykudinkin requested a review from a team as a code owner October 23, 2025 05:02

alexeykudinkin changed the base branch from master to ak/bndl-blk-fix October 23, 2025 05:20

This comment was marked as outdated.

Sign in to view

alexeykudinkin force-pushed the ak/bndl-blk-fix branch from bb87078 to ea982b3 Compare October 24, 2025 06:48

bveeramani approved these changes Oct 24, 2025

View reviewed changes

python/ray/tests/test_runtime_env_working_dir.py Outdated

@ray.remote

def test_import():

import file_module

Copy link

Member

bveeramani Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated?

alexeykudinkin force-pushed the ak/bndl-blk-fix branch from 5379f9c to e8ab2c8 Compare October 27, 2025 05:05

alexeykudinkin force-pushed the ak/res-mngr-clup branch from b7ec22e to 7881742 Compare October 27, 2025 05:12

This comment was marked as outdated.

Sign in to view

alexeykudinkin force-pushed the ak/bndl-blk-fix branch from bad2cd7 to 65e7295 Compare October 27, 2025 18:07

alexeykudinkin force-pushed the ak/res-mngr-clup branch from c42f2c3 to 0b1eb1b Compare October 27, 2025 18:38

alexeykudinkin force-pushed the ak/bndl-blk-fix branch from 65e7295 to 7165108 Compare October 27, 2025 19:14

alexeykudinkin force-pushed the ak/res-mngr-clup branch from 0b1eb1b to 6d9160a Compare October 27, 2025 19:15

alexeykudinkin deleted the branch ray-project:master October 27, 2025 20:31

alexeykudinkin closed this Oct 27, 2025

alexeykudinkin reopened this Oct 27, 2025

alexeykudinkin changed the base branch from ak/bndl-blk-fix to master October 27, 2025 20:37

alexeykudinkin added 3 commits October 27, 2025 13:38

Cleaning up OpResourceAllocator

cefafa9

Signed-off-by: Alexey Kudinkin <ak@anyscale.com> # Conflicts: # python/ray/data/_internal/execution/streaming_executor_state.py Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

2160465

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated refs

e579fdf

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added 4 commits October 27, 2025 13:38

Fixed tests;

508411e

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed tests

7aa6b18

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing more tests

af3845a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

eeb01a7

Signed-off-by: Alexey Kudinkin <ak@anyscale.com> # Conflicts: # python/ray/data/tests/test_autoscaler.py Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/res-mngr-clup branch from 6d9160a to eeb01a7 Compare October 27, 2025 20:38

cursor bot reviewed Oct 27, 2025

View reviewed changes

alexeykudinkin merged commit 95b011f into ray-project:master Oct 27, 2025
6 checks passed

	avg_inputs_per_task = (
	output_dependency.metrics.num_task_inputs_processed
	/ max(output_dependency.metrics.num_tasks_finished, 1)
	)

[Data] Revisiting OpResourceAllocator to make data flow explicit #57788

[Data] Revisiting OpResourceAllocator to make data flow explicit #57788

Uh oh!

Conversation

alexeykudinkin commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Types of change

Checklist

Additional context

Uh oh!

gemini-code-assist bot commented Oct 16, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Oct 27, 2025

Choose a reason for hiding this comment

Bug: Inconsistent Return Types in Resource Management

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Data] Revisiting `OpResourceAllocator` to make data flow explicit #57788

[Data] Revisiting `OpResourceAllocator` to make data flow explicit #57788

alexeykudinkin commented Oct 16, 2025 •

edited

Loading