[Data] Remove constructor from `IssueDetector` base class #58852

machichima · 2025-11-20T13:37:03Z

Description

Based on the comment here: #58630 (comment)

Current IssueDetector base class requires all its subclasses include the StreamingExecutor as the arguments, making classes hard to mock and test because we have to mock all of StreamingExecutor.

In this PR, we did following:

Remove constructor in IssueDetector base class and add from_executor() to setup the class based on the executor
Refactor subclasses of IssueDetector to use this format

Related issues

Related to #58562

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: machichima <nary12321@gmail.com>

gemini-code-assist

Code Review

This pull request refactors the IssueDetector base class and its subclasses to improve testability by decoupling them from StreamingExecutor. The core change is the removal of the constructor dependency on StreamingExecutor and the introduction of a from_executor class method factory pattern. HangingExecutionIssueDetector has been fully refactored to take its dependencies directly in its constructor, making it easy to instantiate in tests. Other detectors like HashShuffleAggregatorIssueDetector and HighMemoryIssueDetector have been adapted to the new factory pattern incrementally, which is a reasonable approach. The refactoring in HangingExecutionIssueDetector also simplifies the task processing logic and includes a fix for cleaning up hanging task state by ensuring _hanging_op_tasks is correctly updated when tasks are no longer active. Overall, the changes are well-executed and improve the codebase's design.

Signed-off-by: machichima <nary12321@gmail.com>

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

…nIssueDetector Signed-off-by: machichima <nary12321@gmail.com>

Signed-off-by: machichima <nary12321@gmail.com>

bveeramani · 2025-11-21T01:41:48Z

Hey @machichima, do you want a review on this?

Signed-off-by: machichima <nary12321@gmail.com>

machichima · 2025-11-21T01:50:56Z

Hey @machichima, do you want a review on this?

Yes! Thank you so much! Just fix the lint error. This PR is for point 1 and 2 mentioned in #58630 (comment)

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

bveeramani · 2025-11-21T19:04:26Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

+        ctx = executor._data_context
+        return cls(
+            dataset_id=executor._dataset_id,
+            operators=lambda: list(executor._topology.keys())


What happens if we pass the list of operators directly here (Rather than a callable)?

Agree, I think we should either pass a list directly or rename this parameter name? From the name I'm not sure if it will be obvious that this is a callable, maybe something like get_operators_fn?

I pass the lambda function instead of list due to this issue: #58852 (comment)
I will rename this args to make it more clear!

Rename it to get_operators_fn in e0564fe

I don't think Cursor is correct here. The topology can't change at runtime.

Since it's simpler and also safe to just pass in a list of operators, I think we should do that instead

bveeramani · 2025-11-21T19:23:53Z

python/ray/data/tests/test_issue_detection.py


        executor = StreamingExecutor(ctx)
-        detector = HangingExecutionIssueDetector(executor, ctx)
+        detector = HangingExecutionIssueDetector.from_executor(executor)


If we directly test using the constructor, we can simplify this test by removing lines 50 and 48.

Suggested change

detector = HangingExecutionIssueDetector.from_executor(executor)

detector = HangingExecutionIssueDetector(dataset_id="id", ops=[], config=config)

Fixed in b63591f

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

bveeramani · 2025-11-21T19:28:37Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py


        return issues

    def detect(self) -> List[Issue]:


@omatthew98 could you help me verify the correctness of this diff?

Yeah it lgtm.

omatthew98

Overall looks good, I think we should make a decision if we should allow references to the streaming executor in the detectors (seems like no), and we should align on that across all the detectors. If it doesn't make sense to have the reference in the hanging detector, I am not sure why it would for the other detectors.

omatthew98 · 2025-11-21T22:30:09Z

python/ray/data/_internal/issue_detection/issue_detector.py

-    def __init__(self, executor: "StreamingExecutor", ctx: "DataContext"):
-        self._executor = executor
-        self._ctx = ctx
+    @classmethod


Let's include the implementation from hash_shuffle_detector.py / high_memory_detector.py here?

Although if we are removing the streaming executor reference from HangingExecutionIssueDetector, should we just remove it from all of the implementations of IssueDetector?

Yes! I removed the streaming executor ref from other sub-classes of IssueDetector:

Remove from HighMemoryIssueDetector : 89242fd

Remove from HashShuffleAggregatorIssueDetector: 6ac3247

Could you please elaborate more on what you mean include the implementation from hash_shuffle_detector.py / high_memory_detector.py?

omatthew98 · 2025-11-21T22:33:26Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

+        ctx = executor._data_context
+        return cls(
+            dataset_id=executor._dataset_id,
+            operators=lambda: list(executor._topology.keys())


Agree, I think we should either pass a list directly or rename this parameter name? From the name I'm not sure if it will be obvious that this is a callable, maybe something like get_operators_fn?

omatthew98 · 2025-11-21T22:39:27Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

            op_task_stats_map[operator.id] = op_metrics._op_task_duration_stats
            self._op_id_to_name[operator.id] = operator.name
-            if op_state._finished:
+            if operator.execution_finished():


Not sure if there are minor differences between op_state._finished and operator.execution_finished(), but probably better to use the function here.

omatthew98 · 2025-11-21T22:54:31Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

            for task_idx, state_value in op_state_values.items():
                curr_time = time.perf_counter() - state_value.start_time_hanging
-                if op_task_stats.count() > self._op_task_stats_min_count:
+                if op_task_stats.count() >= self._op_task_stats_min_count:


OOC was there a reason to add equality here? Just because semantically it makes sense?

Yes I think it's more intuitive using >=. As when we see min_count it's more likely for us to view it as count >= min_count.

omatthew98 · 2025-11-21T22:54:46Z

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py


        return issues

    def detect(self) -> List[Issue]:


Yeah it lgtm.

Signed-off-by: machichima <nary12321@gmail.com>

…ueDetector Signed-off-by: machichima <nary12321@gmail.com>

Signed-off-by: machichima <nary12321@gmail.com>

python/ray/data/_internal/issue_detection/detectors/high_memory_detector.py

machichima · 2025-11-22T03:46:31Z

python/ray/data/_internal/issue_detection/detectors/hash_shuffle_detector.py

+class HashShuffleAggregatorIssueDetectorConfig:
+    """Configuration for HashShuffleAggregatorIssueDetector."""
+    detection_time_interval_s: float = 30.0
+    min_wait_time_s: float = 300.0


Add new config dataclass so that we do not need to pass ctx in constructor.
This config class is passed to python/ray/data/_internal/issue_detection/issue_detector_configuration.py, and we pass those values from ctx to this config in python/ray/data/context.py

Signed-off-by: machichima <nary12321@gmail.com>

machichima · 2025-11-22T03:57:46Z

Hi @bveeramani @omatthew98 ,
I fixed the review comments. PTAL
Thank you!

Signed-off-by: machichima <nary12321@gmail.com>

bveeramani

Overall LGTM!

Could you update the implementation to pass in a list of operators directly (Cursor's review was wrong), and then I'll approve the PR?

bveeramani · 2025-11-24T18:05:05Z

python/ray/data/_internal/issue_detection/detectors/high_memory_detector.py

+            # Track if new operators are added after initialization
+            if op not in self._initial_memory_requests:
+                self._initial_memory_requests[op] = (
+                    op._get_dynamic_ray_remote_args().get("memory") or 0
+                )
+


If we pass in the list of operators directly, I don't think we need to do this

Yes! Let me remove it!

Signed-off-by: machichima <nary12321@gmail.com>

cursor · 2025-11-25T10:29:22Z

python/ray/data/context.py

+        )
+        self.issue_detectors_config.hash_shuffle_detector_config.min_wait_time_s = (
+            self.min_hash_shuffle_aggregator_wait_time_in_s
+        )


Bug: Config field changes after initialization are silently ignored

The sync of hash_shuffle_aggregator_health_warning_interval_s and min_hash_shuffle_aggregator_wait_time_in_s to hash_shuffle_detector_config only happens once during __post_init__. Previously, the detector read these values dynamically from self._ctx on each call to detection_time_interval_s() and _should_emit_warning(). Now, if a user modifies these fields after DataContext creation (e.g., ctx.hash_shuffle_aggregator_health_warning_interval_s = 60), the changes are silently ignored and the detector continues using the values captured at initialization. This is a behavioral regression for users who configure these fields after getting the current context.

I think this is fine.

Oh thanks! I was just thinking about this.

Signed-off-by: machichima <nary12321@gmail.com>

bveeramani · 2025-11-25T10:43:05Z

python/ray/data/tests/test_issue_detection.py

+if TYPE_CHECKING:
+    pass
+


bveeramani · 2025-11-25T10:43:32Z

ty for the contribution! Just lemme know when Ci is passing and I'll merge the PR

Signed-off-by: machichima <nary12321@gmail.com>

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py

Signed-off-by: machichima <nary12321@gmail.com>

bveeramani · 2025-11-26T03:19:29Z

@machichima PR has been merged!

Rewrite the hanging-detection test to directly test a HangingIssueDetector instance.
Once the constructor is simplified, tests can instantiate HangingIssueDetector directly with minimal, focused inputs. This avoids the brittle/hacky-ness of trying to mock internal implementation details.

Now that we've refactored the constructor, would you be down to open a follow-up PR and rewrite the test?

## Description Based on the comment here: ray-project#58630 (comment) Current `IssueDetector` base class requires all its subclasses include the `StreamingExecutor` as the arguments, making classes hard to mock and test because we have to mock all of StreamingExecutor. In this PR, we did following: 1. Remove constructor in `IssueDetector` base class and add `from_executor()` to setup the class based on the executor 2. Refactor subclasses of `IssueDetector` to use this format ## Related issues Related to ray-project#58562 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com>

refactor: remove constructor for IssueDetector

940fd32

Signed-off-by: machichima <nary12321@gmail.com>

machichima requested a review from a team as a code owner November 20, 2025 13:37

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

machichima and others added 2 commits November 20, 2025 22:09

refactor: lint

1885c47

Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 58562-remove-issue-detector-constructor

41c6499

cursor bot reviewed Nov 20, 2025

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Show resolved Hide resolved

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 20, 2025

machichima and others added 3 commits November 21, 2025 08:36

refactor: use callable for dynamic operator access in HangingExecutio…

9ff4b52

…nIssueDetector Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 58562-remove-issue-detector-constructor

d04fd8e

fix: use >= for min count

55bf81b

Signed-off-by: machichima <nary12321@gmail.com>

refactor: lint

651e08f

Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 58562-remove-issue-detector-constructor

89aa796

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Show resolved Hide resolved

gvspraveen requested a review from bveeramani November 21, 2025 18:42

bveeramani reviewed Nov 21, 2025

View reviewed changes

omatthew98 reviewed Nov 21, 2025

View reviewed changes

refactor: rename operators args make it clearer

e0564fe

Signed-off-by: machichima <nary12321@gmail.com>

machichima force-pushed the 58562-remove-issue-detector-constructor branch from 6280de0 to e0564fe Compare November 22, 2025 02:24

machichima added 4 commits November 22, 2025 10:26

test: pass to contructor rather than use from_executor

b63591f

Signed-off-by: machichima <nary12321@gmail.com>

refactor: remove executor from constructor in HighMemoryIssueDetector

89242fd

Signed-off-by: machichima <nary12321@gmail.com>

refactor remove executor from constructor in HashShuffleAggregatorIss…

6ac3247

…ueDetector Signed-off-by: machichima <nary12321@gmail.com>

fix: pass DataContext args to newly added hash shuffle config

c6bf4f6

Signed-off-by: machichima <nary12321@gmail.com>

machichima force-pushed the 58562-remove-issue-detector-constructor branch from 42f6731 to c6bf4f6 Compare November 22, 2025 03:44

cursor bot reviewed Nov 22, 2025

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/high_memory_detector.py Show resolved Hide resolved

machichima commented Nov 22, 2025

View reviewed changes

machichima and others added 2 commits November 22, 2025 11:56

fix: add op to _initial_memory_requests if added after init

15ea8a2

Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 58562-remove-issue-detector-constructor

8d2a58e

machichima and others added 3 commits November 23, 2025 21:15

refactor: lint

920137d

Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 58562-remove-issue-detector-constructor

6893f9b

Merge branch 'master' into 58562-remove-issue-detector-constructor

fd8e89a

bveeramani reviewed Nov 24, 2025

View reviewed changes

bveeramani self-assigned this Nov 25, 2025

machichima and others added 2 commits November 25, 2025 18:23

fix: operators from callable to list

87ae696

Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 58562-remove-issue-detector-constructor

e2c58eb

cursor bot reviewed Nov 25, 2025

View reviewed changes

refactor: lint

dd2a418

Signed-off-by: machichima <nary12321@gmail.com>

bveeramani approved these changes Nov 25, 2025

View reviewed changes

refactor: remove if TYPE_CHECKING block

3ba6080

Signed-off-by: machichima <nary12321@gmail.com>

cursor bot reviewed Nov 25, 2025

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

machichima and others added 3 commits November 25, 2025 18:59

fix: update unused execution_finished function

8cc6412

Signed-off-by: machichima <nary12321@gmail.com>

Merge branch 'master' into 58562-remove-issue-detector-constructor

6bb15ec

Merge branch 'master' into 58562-remove-issue-detector-constructor

da129ac

bveeramani enabled auto-merge (squash) November 25, 2025 19:03

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 25, 2025

bveeramani merged commit dcf0923 into ray-project:master Nov 25, 2025
7 of 8 checks passed

bveeramani changed the title ~~[Data] remove issue detector constructor~~ [Data] Remove constructor from IssueDetector base class Nov 26, 2025

machichima mentioned this pull request Nov 28, 2025

[Data][Test] make test_hanging_detector_detects_issues deterministic #59060

Open

	detector = HangingExecutionIssueDetector.from_executor(executor)
	detector = HangingExecutionIssueDetector(dataset_id="id", ops=[], config=config)

[Data] Remove constructor from IssueDetector base class #58852

[Data] Remove constructor from IssueDetector base class #58852

Conversation

machichima commented Nov 20, 2025

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

bveeramani commented Nov 21, 2025

Uh oh!

machichima commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omatthew98 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 25, 2025

Choose a reason for hiding this comment

Bug: Config field changes after initialization are silently ignored

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[Data] Remove constructor from `IssueDetector` base class #58852

[Data] Remove constructor from `IssueDetector` base class #58852

machichima commented Nov 21, 2025 •

edited

Loading

machichima Nov 22, 2025 •

edited

Loading

machichima Nov 22, 2025 •

edited

Loading

machichima commented Nov 22, 2025 •

edited

Loading