-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] ConcurrencyCapBackpressurePolicy - Handle internal output queue buildup #57996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] ConcurrencyCapBackpressurePolicy - Handle internal output queue buildup #57996
Conversation
…e buildup Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a sophisticated adaptive backpressure policy, ConcurrencyCapBackpressurePolicy, to dynamically manage operator concurrency based on queue pressure. This is a significant enhancement, moving from a static concurrency cap to a dynamic one using EWMA for queue level and deviation, an adaptive threshold, and a quantized step controller. The changes also include refactoring ResourceManager to centralize operator eligibility and downstream traversal logic, and adding a new context flag to enable this feature.
The new backpressure logic is complex but well-documented with extensive comments and examples. The accompanying unit tests are very thorough, which is excellent.
I've found one critical bug in the controller logic that makes one of the control branches unreachable, and another minor issue with what appears to be dead code in the threshold update logic. After addressing these points, this will be a solid improvement to Ray Data's execution engine.
python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/backpressure_policy/concurrency_cap_backpressure_policy.py
Outdated
Show resolved
Hide resolved
| return self.get_mem_op_internal(op) + self.get_op_outputs_usage_with_downstream( | ||
| op | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead expose _get_op_internal_object_store_usage
| else: | ||
| yield from self.get_downstream_eligible_ops(next_op) | ||
|
|
||
| def get_op_outputs_usage_with_downstream(self, op: PhysicalOperator) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def get_op_outputs_usage_with_downstream(self, op: PhysicalOperator) -> float: | |
| def _get_op_outputs_object_store_usage_with_downstream(self, op: PhysicalOperator) -> float: |
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Training Benchmark results |
…e buildup (ray-project#57996) …e buildup > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] ConcurrencyCapBackpressurePolicy - Handle internal output queue buildup **Issue** - When there is internal output queue buildup specifically when preserve_order is set, we don't limit task concurrency in streaming executor and just honor static concurrency cap. - When concurrency cap is unlimited, we keep queuing more Blocks into internal output queue leading to spill and steep spill curve. **Solution** In ConcurrencyCapBackpressurePolicy, detect internal output queue buildup and then limit the concurrency of the tasks. - Keep the internal output queue history and detect trends in percentage & size in GBs. Based on trends, increase/decrease the concurrency cap. - Given queue based buffering is needed for `preserve_order`, allow adaptive queuing threshold. This would result in Spill, but would flatten out the Spill curve and not cause run away buffering queue growth. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>
…e buildup (ray-project#57996) …e buildup > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] ConcurrencyCapBackpressurePolicy - Handle internal output queue buildup **Issue** - When there is internal output queue buildup specifically when preserve_order is set, we don't limit task concurrency in streaming executor and just honor static concurrency cap. - When concurrency cap is unlimited, we keep queuing more Blocks into internal output queue leading to spill and steep spill curve. **Solution** In ConcurrencyCapBackpressurePolicy, detect internal output queue buildup and then limit the concurrency of the tasks. - Keep the internal output queue history and detect trends in percentage & size in GBs. Based on trends, increase/decrease the concurrency cap. - Given queue based buffering is needed for `preserve_order`, allow adaptive queuing threshold. This would result in Spill, but would flatten out the Spill curve and not cause run away buffering queue growth. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
…e buildup (ray-project#57996) …e buildup > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] ConcurrencyCapBackpressurePolicy - Handle internal output queue buildup **Issue** - When there is internal output queue buildup specifically when preserve_order is set, we don't limit task concurrency in streaming executor and just honor static concurrency cap. - When concurrency cap is unlimited, we keep queuing more Blocks into internal output queue leading to spill and steep spill curve. **Solution** In ConcurrencyCapBackpressurePolicy, detect internal output queue buildup and then limit the concurrency of the tasks. - Keep the internal output queue history and detect trends in percentage & size in GBs. Based on trends, increase/decrease the concurrency cap. - Given queue based buffering is needed for `preserve_order`, allow adaptive queuing threshold. This would result in Spill, but would flatten out the Spill curve and not cause run away buffering queue growth. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…e buildup
Description
[Data] ConcurrencyCapBackpressurePolicy - Handle internal output
queue buildup
Issue
preserve_order is set, we don't limit task concurrency in streaming
executor and just honor static concurrency cap.
internal output queue leading to spill and steep spill curve.
Solution
In ConcurrencyCapBackpressurePolicy, detect internal output queue
buildup and then limit the concurrency of the tasks.
& size in GBs. Based on trends, increase/decrease the concurrency cap.
preserve_order, allowadaptive queuing threshold. This would result in Spill, but would
flatten out the Spill curve and not cause run away buffering queue
growth.
Related issues
Additional information