[Data] Mock hanging task check in test_hanging_detector_detects_issues to prevent flaky #58630

machichima · 2025-11-14T11:38:17Z

Description

mock the HangingExecutionState to set start_hanging_time to 0 for one of the calls

Related issues

Additional information

While this PR is trying to fix the flaky test, I use following script to run test for 20 times and ensure they all passed

#!/bin/bash

pass=0
fail=0

for i in {1..20}; do
  if python -m pytest python/ray/data/tests/test_issue_detection.py::TestHangingExecutionIssueDetector::test_hanging_detector_detects_issues -xvs > /dev/null 2>&1; then
    ((pass++))
  else
    ((fail++))
  fi
done

echo ""
echo "Passed: $pass, Failed: $fail"

Signed-off-by: machichima <nary12321@gmail.com>

gemini-code-assist

Code Review

This pull request effectively addresses a flaky test by mocking the hanging task detection logic, which is a great improvement. The extraction of _is_task_hanging is a clean way to enable this mocking. I have two suggestions: one is to remove a leftover debug print statement, and the other is a minor style improvement in the test mock implementation for better clarity.

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

python/ray/data/tests/test_issue_detection.py

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

python/ray/data/tests/test_issue_detection.py

Signed-off-by: machichima <nary12321@gmail.com>

machichima · 2025-11-14T11:45:40Z

@owenowenisme @bveeramani PTAL. Thank you~

machichima · 2025-11-14T12:34:16Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

            for task_idx, state_value in op_state_values.items():
                curr_time = time.perf_counter() - state_value.start_time_hanging
-                if op_task_stats.count() > self._op_task_stats_min_count:
+                if op_task_stats.count() >= self._op_task_stats_min_count:


I think using >= here is more intuitive? As when we see min_count it's more likely for us to view it as count >= min_count

Yeah, good catch!

Signed-off-by: machichima <nary12321@gmail.com>

owenowenisme · 2025-11-17T04:06:43Z

python/ray/data/tests/test_issue_detection.py

+        def mock_state_constructor(**kwargs):
+            # set start_time_hanging to 0 for task with task_idx == 1 to make
+            # time.perf_counter() - state_value.start_time_hanging large
+            if kwargs.get("task_idx") == 1:
+                kwargs["start_time_hanging"] = 0.0
+            # Call the real class with kwargs modified
+            return RealHangingExecutionState(**kwargs)
+
+        with patch(
+            "ray.data._internal.issue_detection.detectors.hanging_detector.HangingExecutionState"
+        ) as mock_state_cls:
+            mock_state_cls.side_effect = mock_state_constructor


This seems hacky to modify internal logic of start_time_hanging, let's just mock the detector and make it a unit test.

For making it into a unit test, should we move it into python/ray/data/tests/unit/?

And for mock the detector, do you mean mock the whole HangingExecutionIssueDetector? I thought we want to test if HangingExecutionIssueDetector.detect works correctly here.

For making it into a unit test, should we move it into python/ray/data/tests/unit/?

I think its fine to leave the test here.

We just want to see the issue is produced when the condition is met, therefore we don't need to really execute the dataset.

This will deflake the test and make it deterministic.

After second thought I think the better way is we hang certain task instead of modify its start time.

Should do Something like this

# Create a pipeline with many small blocks to ensure concurrent tasks def sleep_task(x): if x["id"] == 2: time.sleep(1.0) return x ...

And remove this part

def mock_state_constructor(**kwargs): # set start_time_hanging to 0 for task with task_idx == 1 to make # time.perf_counter() - state_value.start_time_hanging large if kwargs.get("task_idx") == 1: kwargs["start_time_hanging"] = 0.0 # Call the real class with kwargs modified return RealHangingExecutionState(**kwargs) with patch( "ray.data._internal.issue_detection.detectors.hanging_detector.HangingExecutionState" ) as mock_state_cls: mock_state_cls.side_effect = mock_state_constructor

machichima · 2025-11-19T03:17:02Z

Hi @bveeramani
I would like to query about the expectation of this test. I'm happy to discuss if you have time
Thanks!

bveeramani · 2025-11-19T03:53:11Z

Hi @bveeramani I would like to query about the expectation of this test. I'm happy to discuss if you have time Thanks!

Hey @machichima, thanks for following up!

I think this test is hard to improve (right now) because the abstractions aren't testable.

I think we should address this by opening three PRs that do the following:

1. Remove the constructor from IssueDetector and introduce a factory method like IssueDetector.for_executor.

Right now, the issue detector base class forces every subclass to use a specific constructor signature that includes the complex StreamingExecutor type. This is problematic because different detectors might not need all of the information in the executor, and it makes tests harder to write because you have to mock all of StreamingExecutor (even if you don't need most of its information!)

2. Trim the dependencies for HangingIssueDetector and update its constructor accordingly.

This keeps the dependency surface small and makes the class easier to reason about and test. Each detector should declare only what it truly needs rather than inheriting a one-size-fits-all set of constructor arguments.

3. Rewrite the hanging-detection test to directly test a HangingIssueDetector instance.

Once the constructor is simplified, tests can instantiate HangingIssueDetector directly with minimal, focused inputs. This avoids the brittle/hacky-ness of trying to mock internal implementation details.

I've sketched out a solution here: #58770 -- what do you think?

machichima · 2025-11-19T12:14:50Z

Hi @bveeramani ,
Thank you for the detailed guideline! It makes sense to me!
I'll give it a go 🙏

## Description Based on the comment here: #58630 (comment) Current `IssueDetector` base class requires all its subclasses include the `StreamingExecutor` as the arguments, making classes hard to mock and test because we have to mock all of StreamingExecutor. In this PR, we did following: 1. Remove constructor in `IssueDetector` base class and add `from_executor()` to setup the class based on the executor 2. Refactor subclasses of `IssueDetector` to use this format ## Related issues Related to #58562 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com>

## Description Based on the comment here: ray-project#58630 (comment) Current `IssueDetector` base class requires all its subclasses include the `StreamingExecutor` as the arguments, making classes hard to mock and test because we have to mock all of StreamingExecutor. In this PR, we did following: 1. Remove constructor in `IssueDetector` base class and add `from_executor()` to setup the class based on the executor 2. Refactor subclasses of `IssueDetector` to use this format ## Related issues Related to ray-project#58562 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com>

machichima added 2 commits November 14, 2025 19:30

refactor: extract check for task hanging for mocking

d1e155a

Signed-off-by: machichima <nary12321@gmail.com>

test: mock hanging check

6640d2b

Signed-off-by: machichima <nary12321@gmail.com>

machichima requested a review from a team as a code owner November 14, 2025 11:38

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_issue_detection.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 14, 2025

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

machichima commented Nov 14, 2025

View reviewed changes

python/ray/data/tests/test_issue_detection.py Outdated Show resolved Hide resolved

machichima and others added 2 commits November 14, 2025 19:43

fix: remove print and use nonlocal

2f900f8

Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 58562-hanging-detector-flaky

5b645ac

machichima commented Nov 14, 2025

View reviewed changes

test: mock HangingExecutionState instead

af1039d

Signed-off-by: machichima <nary12321@gmail.com>

machichima force-pushed the 58562-hanging-detector-flaky branch from 82ab8fc to af1039d Compare November 14, 2025 12:34

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 14, 2025

machichima added 2 commits November 14, 2025 15:41

Merge branch 'master' into 58562-hanging-detector-flaky

70c3ae6

Merge branch 'master' into 58562-hanging-detector-flaky

901a6fd

owenowenisme reviewed Nov 17, 2025

View reviewed changes

machichima mentioned this pull request Nov 20, 2025

[Data] Remove constructor from IssueDetector base class #58852

Merged

machichima mentioned this pull request Nov 28, 2025

[Data][Test] make test_hanging_detector_detects_issues deterministic #59060

Open

[Data] Mock hanging task check in test_hanging_detector_detects_issues to prevent flaky #58630

Are you sure you want to change the base?

[Data] Mock hanging task check in test_hanging_detector_detects_issues to prevent flaky #58630

Uh oh!

Conversation

machichima commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machichima commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

machichima Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

owenowenisme Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

owenowenisme Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

machichima Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owenowenisme Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

owenowenisme Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

machichima commented Nov 19, 2025

Uh oh!

bveeramani commented Nov 19, 2025

Uh oh!

machichima commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

machichima commented Nov 14, 2025 •

edited

Loading

machichima commented Nov 14, 2025 •

edited

Loading

machichima Nov 17, 2025 •

edited

Loading