-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Mock hanging task check in test_hanging_detector_detects_issues to prevent flaky #58630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d1e155a
6640d2b
2f900f8
5b645ac
af1039d
70c3ae6
901a6fd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -104,36 +104,49 @@ def test_hanging_detector_detects_issues( | |
| ): | ||
| """Test hanging detector adaptive thresholds with real Ray Data pipelines and extreme configurations.""" | ||
|
|
||
| ctx = DataContext.get_current() | ||
| # Configure hanging detector with extreme std_factor values | ||
| ctx.issue_detectors_config.hanging_detector_config = ( | ||
| HangingExecutionIssueDetectorConfig( | ||
| op_task_stats_min_count=1, | ||
| op_task_stats_std_factor=1, | ||
| detection_time_interval_s=0, | ||
| ) | ||
| from ray.data._internal.issue_detection.detectors.hanging_detector import ( | ||
| HangingExecutionState as RealHangingExecutionState, | ||
| ) | ||
|
|
||
| # Create a pipeline with many small blocks to ensure concurrent tasks | ||
| def sleep_task(x): | ||
| if x["id"] == 2: | ||
| # Issue detection is based on the mean + stdev. One of the tasks must take | ||
| # awhile, so doing it just for one of the rows. | ||
| time.sleep(1) | ||
| return x | ||
| def mock_state_constructor(**kwargs): | ||
| # set start_time_hanging to 0 for task with task_idx == 1 to make | ||
| # time.perf_counter() - state_value.start_time_hanging large | ||
| if kwargs.get("task_idx") == 1: | ||
| kwargs["start_time_hanging"] = 0.0 | ||
| # Call the real class with kwargs modified | ||
| return RealHangingExecutionState(**kwargs) | ||
|
|
||
| with patch( | ||
| "ray.data._internal.issue_detection.detectors.hanging_detector.HangingExecutionState" | ||
| ) as mock_state_cls: | ||
| mock_state_cls.side_effect = mock_state_constructor | ||
|
Comment on lines
+111
to
+122
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems hacky to modify internal logic of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For making it into a unit test, should we move it into And for mock the detector, do you mean mock the whole
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think its fine to leave the test here. We just want to see the issue is produced when the condition is met, therefore we don't need to really execute the dataset. This will deflake the test and make it deterministic.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After second thought I think the better way is we hang certain task instead of modify its start time. Should do Something like this # Create a pipeline with many small blocks to ensure concurrent tasks
def sleep_task(x):
if x["id"] == 2:
time.sleep(1.0)
return x
...And remove this part def mock_state_constructor(**kwargs):
# set start_time_hanging to 0 for task with task_idx == 1 to make
# time.perf_counter() - state_value.start_time_hanging large
if kwargs.get("task_idx") == 1:
kwargs["start_time_hanging"] = 0.0
# Call the real class with kwargs modified
return RealHangingExecutionState(**kwargs)
with patch(
"ray.data._internal.issue_detection.detectors.hanging_detector.HangingExecutionState"
) as mock_state_cls:
mock_state_cls.side_effect = mock_state_constructor |
||
|
|
||
| ctx = DataContext.get_current() | ||
| # Configure hanging detector with extreme std_factor values | ||
| ctx.issue_detectors_config.hanging_detector_config = ( | ||
| HangingExecutionIssueDetectorConfig( | ||
| op_task_stats_min_count=1, | ||
| op_task_stats_std_factor=1, | ||
| detection_time_interval_s=0, | ||
| ) | ||
| ) | ||
|
|
||
| with caplog.at_level(logging.WARNING): | ||
| ray.data.range(3, override_num_blocks=3).map( | ||
| sleep_task, concurrency=1 | ||
| ).materialize() | ||
| # Create a pipeline with many small blocks to ensure concurrent tasks | ||
| def sleep_task(x): | ||
| return x | ||
|
|
||
| # Check if hanging detection occurred | ||
| hanging_detected = ( | ||
| "has been running for" in caplog.text | ||
| and "longer than the average task duration" in caplog.text | ||
| ) | ||
| with caplog.at_level(logging.WARNING): | ||
| ray.data.range(3, override_num_blocks=3).map( | ||
| sleep_task, concurrency=1 | ||
| ).materialize() | ||
|
|
||
| # Check if hanging detection occurred | ||
| hanging_detected = ( | ||
| "has been running for" in caplog.text | ||
| and "longer than the average task duration" in caplog.text | ||
| ) | ||
|
|
||
| assert hanging_detected, caplog.text | ||
| assert hanging_detected, caplog.text | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using
>=here is more intuitive? As when we seemin_countit's more likely for us to view it ascount >= min_countThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch!