feat(#105): add fault correlation engine for root-cause analysis#121
Merged
mfaferek93 merged 8 commits intomainfrom Jan 21, 2026
Merged
feat(#105): add fault correlation engine for root-cause analysis#121mfaferek93 merged 8 commits intomainfrom
mfaferek93 merged 8 commits intomainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR implements a fault correlation engine for root-cause analysis that reduces noise by identifying hierarchical relationships between faults and grouping related faults into clusters. The feature addresses issue #105 by enabling automatic detection of fault cascades where symptom faults are muted when their root cause is identified.
Changes:
- Adds hierarchical correlation mode to detect root cause → symptom relationships with automatic symptom muting
- Adds auto-cluster mode to group similar faults occurring within time windows
- Extends GetFaults.srv, ClearFault.srv, and FaultEvent.msg with correlation fields (muted faults, clusters, auto-cleared codes)
- Implements YAML-based configuration with wildcard pattern matching
- Integrates correlation engine into fault_manager_node with parameter-based enablement
- Adds comprehensive unit and integration tests for pattern matching, config parsing, and correlation logic
- Updates REST API to expose correlation data via query parameters
- Includes complete documentation and Postman collection updates
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ros2_medkit_msgs/msg/MutedFaultInfo.msg | New message type for muted symptom fault metadata |
| src/ros2_medkit_msgs/msg/ClusterInfo.msg | New message type for fault cluster information |
| src/ros2_medkit_msgs/srv/GetFaults.srv | Extended with include_muted/include_clusters flags and response fields |
| src/ros2_medkit_msgs/srv/ClearFault.srv | Extended with auto_cleared_codes response field |
| src/ros2_medkit_msgs/msg/FaultEvent.msg | Extended with auto_cleared_codes for EVENT_CLEARED |
| src/ros2_medkit_msgs/CMakeLists.txt | Registers new message types |
| src/ros2_medkit_fault_manager/include//correlation/.hpp | Core correlation engine headers (types, config parser, pattern matcher, engine) |
| src/ros2_medkit_fault_manager/src/correlation/*.cpp | Core correlation engine implementation |
| src/ros2_medkit_fault_manager/src/fault_manager_node.cpp | Integrates correlation engine into fault processing and service handlers |
| src/ros2_medkit_fault_manager/include/*/fault_manager_node.hpp | Adds correlation engine member and helper methods |
| src/ros2_medkit_fault_manager/CMakeLists.txt | Adds correlation source files and unit test targets |
| src/ros2_medkit_fault_manager/test/test_*.cpp | Unit tests for pattern matcher, config parser, and correlation engine |
| src/ros2_medkit_fault_manager/test/test_integration.test.py | Integration tests for hierarchical and auto-cluster modes |
| src/ros2_medkit_fault_manager/test/test_correlation.yaml | Test correlation configuration |
| src/ros2_medkit_gateway/include/*/fault_manager.hpp | Extended get_faults signature with correlation parameters |
| src/ros2_medkit_gateway/src/fault_manager.cpp | Passes correlation parameters to service and formats response |
| src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp | Parses correlation query params and includes correlation data in responses |
| src/ros2_medkit_fault_manager/README.md | Documents correlation configuration and usage |
| docs/tutorials/fault-correlation.rst | Complete tutorial for configuring and using correlation |
| docs/tutorials/index.rst | Adds correlation tutorial to index |
| postman/collections/*.json | Updates API documentation with correlation endpoints |
src/ros2_medkit_fault_manager/src/correlation/correlation_engine.cpp
Outdated
Show resolved
Hide resolved
73ecb79 to
d4f6574
Compare
Implement hierarchical and auto-cluster correlation modes to reduce fault noise by identifying root causes and grouping related faults. Key features: - Hierarchical mode: detect root cause → symptom relationships - Auto-cluster mode: group similar faults within time window - YAML-based configuration with pattern wildcards - Muted faults tracking and auto-clear on root cause resolution - Gateway API support with include_muted/include_clusters params New messages: MutedFaultInfo.msg, ClusterInfo.msg Extended: GetFaults.srv, ClearFault.srv, FaultEvent.msg
Add new requests demonstrating fault correlation query parameters: - GET Faults with Correlation Data (include_muted + include_clusters) - GET Faults with Muted Details Only - GET Faults with Cluster Details Only Update existing endpoint descriptions to document: - muted_count/cluster_count fields in GET /faults response - auto_cleared_codes in DELETE clear fault response - Faults folder description with correlation feature overview
Add comprehensive tutorial covering: - Hierarchical mode (root cause → symptom relationships) - Auto-cluster mode (grouping similar faults) - YAML configuration with patterns and rules - REST API query parameters for correlation data - Complete example configuration - Troubleshooting guide
d4f6574 to
5c9cf0e
Compare
nnarain
reviewed
Jan 20, 2026
- Fix data race in PatternMatcher::get_compiled() by adding mutex protection for the mutable compiled_cache_ member - Add retroactive_mute_codes field to ProcessFaultResult for tracking faults that should have been muted before cluster activation - Populate retroactive_mute_codes when cluster transitions from pending to active (reaches min_count threshold) - Add debug logging for retroactive muting in fault_manager_node - Add unit test AutoClusterRetroactiveMuting to verify behavior
Add 5-second timer to periodically call cleanup_expired() on the correlation engine. This prevents stale pending root causes and pending clusters from accumulating in memory when faults are reported infrequently. Add unit tests: - CleanupExpiredRemovesPendingRootCauses - CleanupExpiredRemovesPendingClusters
Collaborator
Author
- Skip regex compilation for non-wildcard patterns (perf improvement) - Add validation error for min_count=0 in auto-cluster rules - Add TODO documenting known limitation: pending_clusters_ not cleaned up when fault is cleared before cluster reaches min_count
a6b7781 to
e3f2c4d
Compare
bburda
reviewed
Jan 21, 2026
src/ros2_medkit_fault_manager/src/correlation/config_parser.cpp
Outdated
Show resolved
Hide resolved
src/ros2_medkit_fault_manager/src/correlation/correlation_engine.cpp
Outdated
Show resolved
Hide resolved
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/correlation/pattern_matcher.hpp
Show resolved
Hide resolved
- Add inline symptom codes support in YAML config (codes: [...] syntax) - Refactor severity_to_string/severity_rank to shared types.hpp/cpp - Add correlation.cleanup_interval_sec parameter (default 5.0s) - Update TODO(#105) → TODO(#127) for pending_clusters cleanup - Add PlantUML sequence diagrams to fault-correlation tutorial Addresses review comments: - Implement missing inline codes parsing (#1) - Fix orphaned TODO reference (#2) - Consolidate duplicated severity functions (#3) - Parameterize magic number 5s (#4) - Add flow diagrams to docs (#6)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Summary
Implement hierarchical and auto-cluster correlation modes to reduce fault noise by identifying root causes and grouping related faults.
Key features:
New messages: MutedFaultInfo.msg, ClusterInfo.msg
Extended: GetFaults.srv, ClearFault.srv, FaultEvent.msg
Issue
Link the related issue (required):
Type
Testing
How was this tested / how should reviewers verify it?
Checklist