-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Mock hanging task check in test_hanging_detector_detects_issues to prevent flaky #58630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Data] Mock hanging task check in test_hanging_detector_detects_issues to prevent flaky #58630
Conversation
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a flaky test by mocking the hanging task detection logic, which is a great improvement. The extraction of _is_task_hanging is a clean way to enable this mocking. I have two suggestions: one is to remove a leftover debug print statement, and the other is a minor style improvement in the test mock implementation for better clarity.
python/ray/data/_internal/issue_detection/detectors/hanging_detector.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/issue_detection/detectors/hanging_detector.py
Outdated
Show resolved
Hide resolved
Signed-off-by: machichima <nary12321@gmail.com>
|
@owenowenisme @bveeramani PTAL. Thank you~ |
| for task_idx, state_value in op_state_values.items(): | ||
| curr_time = time.perf_counter() - state_value.start_time_hanging | ||
| if op_task_stats.count() > self._op_task_stats_min_count: | ||
| if op_task_stats.count() >= self._op_task_stats_min_count: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using >= here is more intuitive? As when we see min_count it's more likely for us to view it as count >= min_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch!
Signed-off-by: machichima <nary12321@gmail.com>
82ab8fc to
af1039d
Compare
| def mock_state_constructor(**kwargs): | ||
| # set start_time_hanging to 0 for task with task_idx == 1 to make | ||
| # time.perf_counter() - state_value.start_time_hanging large | ||
| if kwargs.get("task_idx") == 1: | ||
| kwargs["start_time_hanging"] = 0.0 | ||
| # Call the real class with kwargs modified | ||
| return RealHangingExecutionState(**kwargs) | ||
|
|
||
| with patch( | ||
| "ray.data._internal.issue_detection.detectors.hanging_detector.HangingExecutionState" | ||
| ) as mock_state_cls: | ||
| mock_state_cls.side_effect = mock_state_constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems hacky to modify internal logic of start_time_hanging, let's just mock the detector and make it a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For making it into a unit test, should we move it into python/ray/data/tests/unit/?
And for mock the detector, do you mean mock the whole HangingExecutionIssueDetector? I thought we want to test if HangingExecutionIssueDetector.detect works correctly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For making it into a unit test, should we move it into python/ray/data/tests/unit/?
I think its fine to leave the test here.
We just want to see the issue is produced when the condition is met, therefore we don't need to really execute the dataset.
This will deflake the test and make it deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After second thought I think the better way is we hang certain task instead of modify its start time.
Should do Something like this
# Create a pipeline with many small blocks to ensure concurrent tasks
def sleep_task(x):
if x["id"] == 2:
time.sleep(1.0)
return x
...And remove this part
def mock_state_constructor(**kwargs):
# set start_time_hanging to 0 for task with task_idx == 1 to make
# time.perf_counter() - state_value.start_time_hanging large
if kwargs.get("task_idx") == 1:
kwargs["start_time_hanging"] = 0.0
# Call the real class with kwargs modified
return RealHangingExecutionState(**kwargs)
with patch(
"ray.data._internal.issue_detection.detectors.hanging_detector.HangingExecutionState"
) as mock_state_cls:
mock_state_cls.side_effect = mock_state_constructor|
Hi @bveeramani |
Hey @machichima, thanks for following up! I think this test is hard to improve (right now) because the abstractions aren't testable. I think we should address this by opening three PRs that do the following: 1. Remove the constructor from Right now, the issue detector base class forces every subclass to use a specific constructor signature that includes the complex 2. Trim the dependencies for This keeps the dependency surface small and makes the class easier to reason about and test. Each detector should declare only what it truly needs rather than inheriting a one-size-fits-all set of constructor arguments. 3. Rewrite the hanging-detection test to directly test a Once the constructor is simplified, tests can instantiate I've sketched out a solution here: #58770 -- what do you think? |
|
Hi @bveeramani , |
## Description Based on the comment here: #58630 (comment) Current `IssueDetector` base class requires all its subclasses include the `StreamingExecutor` as the arguments, making classes hard to mock and test because we have to mock all of StreamingExecutor. In this PR, we did following: 1. Remove constructor in `IssueDetector` base class and add `from_executor()` to setup the class based on the executor 2. Refactor subclasses of `IssueDetector` to use this format ## Related issues Related to #58562 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com>
## Description Based on the comment here: ray-project#58630 (comment) Current `IssueDetector` base class requires all its subclasses include the `StreamingExecutor` as the arguments, making classes hard to mock and test because we have to mock all of StreamingExecutor. In this PR, we did following: 1. Remove constructor in `IssueDetector` base class and add `from_executor()` to setup the class based on the executor 2. Refactor subclasses of `IssueDetector` to use this format ## Related issues Related to ray-project#58562 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com>
Description
HangingExecutionStateto setstart_hanging_timeto 0 for one of the callsRelated issues
Closes #58562
Additional information
While this PR is trying to fix the flaky test, I use following script to run test for 20 times and ensure they all passed