-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Revisiting OpResourceAllocator to make data flow explicit
#57788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Revisiting OpResourceAllocator to make data flow explicit
#57788
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
OpResourceAllocator APIsOpResourceAllocator APIs
OpResourceAllocator APIsOpResourceAllocator to make data flow explicit
python/ray/data/_internal/execution/backpressure_policy/resource_budget_backpressure_policy.py
Outdated
Show resolved
Hide resolved
| return self._actor_pool.get_actor_info() | ||
|
|
||
| def get_max_concurrency_limit(self) -> Optional[int]: | ||
| return self._actor_pool.max_size() * self._actor_pool.max_actor_concurrency() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for this PR since this is an existing issue, but if self._actor_pool.max_size() is float("inf"), I think we'd probably want to return None rather than float("inf") for consistency with the return type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
- Looked t/h the code and we need to holistically clean this up (since we define max_size as int)
| 5000.0, | ||
| ] | ||
| task_completion_time: float = metric_field( | ||
| task_completion_time_s: float = metric_field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update test_stats.py and the dashboard code after renaming these metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do
| else 0 | ||
| ) | ||
|
|
||
| return self._pending_dispatch_input_bundles_count() + internal_queue_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the internal queue size to represent blocks rather than bundles, then total_enqueued_input_bundles will return incorrect values, and DownstreamCapacityBackpressurePolicy will break.
I think even if we update total_enqueued_input_bundles to represent blocks, we'd still need to update the DownstreamCapacityBackpressurePolicy logic:
Lines 76 to 79 in 3287523
| avg_inputs_per_task = ( | |
| output_dependency.metrics.num_task_inputs_processed | |
| / max(output_dependency.metrics.num_tasks_finished, 1) | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need to fix that across the board
| """Returns Operator's internal queue size""" | ||
| """Returns Operator's internal queue size (in blocks)""" | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we hoping to achieve by changing the unit of internal_queue_size from bundles to blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that we're assuming that every bundle holds just 1 block, which is not enforced
python/ray/data/_internal/execution/streaming_executor_state.py
Outdated
Show resolved
Hide resolved
| def __init__(self, topology: "Topology"): | ||
| self._topology = topology | ||
| self._idle_detector = self.IdleDetector() | ||
| self._ticker = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is updated in update_budgets, but is it used anywhere else?
Are subclasses required to increment this? If so, I think this should be an explicit part of the interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed to clean up
| @abstractmethod | ||
| def can_submit_new_task(self, op: PhysicalOperator) -> bool: | ||
| """Return whether the given operator can submit a new task.""" | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for copying this from the backpressure policy interface to here? Would the implementation ever be non-trivial?
If the implementation of this method is always going to be like below, it might be better to remove the method to make the OpResourceAllocator interface deeper and simpler
def can_submit_new_task(self, op):
return op.incremental_resource_usage().satisfies_limit(budget)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea here is that the logic whether task can be scheduled should live w/ Resource Allocator (it will be more complicated than the one you referred above)
e271ecb to
e763d0c
Compare
5b23ecb to
9d757b7
Compare
bb87078 to
ea982b3
Compare
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me.
Let's merge #58030 first to minimize size of the diff, and then merge this one?
| @ray.remote | ||
| def test_import(): | ||
| import file_module | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated?
5379f9c to
e8ab2c8
Compare
b7ec22e to
7881742
Compare
bad2cd7 to
65e7295
Compare
c42f2c3 to
0b1eb1b
Compare
65e7295 to
7165108
Compare
0b1eb1b to
6d9160a
Compare
Signed-off-by: Alexey Kudinkin <ak@anyscale.com> # Conflicts: # python/ray/data/_internal/execution/streaming_executor_state.py Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
6d9160a to
eeb01a7
Compare
| op, | ||
| task_resource_usage=self._op_usages, | ||
| output_object_store_usage=self._mem_op_outputs, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Inconsistent Return Types in Resource Management
Type mismatch bug: ResourceManager.max_task_output_bytes_to_read() declares return type as int but calls self._op_resource_allocator.max_task_output_bytes_to_read() which returns Optional[int]. The abstract method in OpResourceAllocator and its implementation in ReservationOpResourceAllocator can return None, but the wrapper method signature promises to always return int. This will cause runtime type errors when None is returned but an int is expected by callers.
…ay-project#57788) <!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ay-project#57788) <!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description This change primarily converts `OpResourceAllocator` APIs to make data flow explicit by exposing required params in the APIs. Additionally: 1. Abstracting common methods inside `OpResourceAllocator` base-class. 2. Adding allocation to progress bar in verbose mode logging budgets & allocations. 3. Adding byte-size of all enqueued blocks to the progress bar ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [ ] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Description
This change primarily converts
OpResourceAllocatorAPIs to make data flow explicit by exposing required params in the APIs.Additionally:
OpResourceAllocatorbase-class.Related issues
Types of change
Checklist
Does this PR introduce breaking changes?
Testing:
Code Quality:
git commit -s)Documentation:
doc/source/(if applicable)Additional context