-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
Test Name
TestHangingExecutionIssueDetector.test_hanging_detector_detects_issues
Test Location
python/ray/data/tests/test_issue_detection.py:136
Issue Description
The test intermittently fails to detect the intentionally created hanging execution, causing the assertion to fail when checking for the expected warning message in the logs.
Root Cause
The test creates a pipeline where one task sleeps for 1 second while others complete quickly, and configures the hanging detector with aggressive settings to detect this. However, the detection is based on timing and statistical thresholds (mean + std), which can be sensitive to:
- System timing variations
- Task scheduling delays
- The exact timing of when the detector runs
- Resource contention on CI machines
The hanging detector may not fire if:
- Tasks complete too quickly relative to detection intervals
- The statistical threshold isn't met due to timing variations
- The detection interval misses the hanging window
Example Failure
AssertionError: <log output without expected hanging detection messages>
assert False
The test expects to find:
- "has been running for"
- "longer than the average task duration"
But these messages don't appear in the captured logs.
Proposed Fix
Make the test more robust to timing variations. Consider whether this functionality needs integration testing or if it could be better tested with unit tests using mocked time.
Additional Context
This test is checking infrastructure for detecting hanging executions, which is inherently timing-dependent. The test may need to be marked as a known flaky test or potentially redesigned to be more deterministic.