-
Notifications
You must be signed in to change notification settings - Fork 67
Description
ML C++ components are generating excessive ERROR-level log messages for recoverable conditions. Production log analysis shows 94% of ERROR messages (764 out of 810) are non-critical issues that should be warnings or debug messages. This makes it difficult to identify genuine system failures.
Background
Problem Statement
In a recent log evaluation, we found 810 ERROR-level messages, but only ~46 represent actual system failures. The remaining 764 are:
- Recoverable numerical issues (handled gracefully)
- Data quality problems (late records, missing data)
- Debug information incorrectly logged as errors
- Expected edge cases in statistical computations
Error Categories & Impact
📊 Summary Table
| Category | Count | % | Current | Proposed | Priority |
|---|---|---|---|---|---|
| Out-of-order records | 315 | 39% | ERROR | WARN | 🔴 High |
| Debug context logs | 167 | 21% | ERROR | DEBUG | 🔴 High |
| Prior update failures (NaN) | 82 | 10% | ERROR | WARN | 🟡 Medium |
| Missing bucket counts | 82 | 10% | ERROR | WARN | 🟡 Medium |
| Process cleanup failures | 58 | 7% | ERROR | WARN | 🟢 Low |
| Bad probability bounds | 25 | 3% | ERROR | WARN | 🟢 Low |
| Multimodal split info | 16 | 2% | ERROR | DEBUG | 🟢 Low |
| Memory check timing | 13 | 2% | ERROR | WARN | 🟢 Low |
| Boost.Math errors | 3 | <1% | ERROR | WARN | 🟢 Low |
| Other validation errors | 3 | <1% | ERROR | WARN | 🟢 Low |
Detailed Error Categories with Examples
🔴 Priority 1: Out-of-Order Records (315 occurrences)
File: lib/api/CAnomalyJob.cc:172
Current Error:
ERROR [file_abnormal_file_writes] [autodetect/15730] [CAnomalyJob.cc@172]
Records must be in ascending time order. Record 'host.name,file.path,@timestamp,file.name,.,process.name,user.name
WSAMZN-CHCJT864,\Device\HarddiskVolume2\...,1765951200,...'
time 1765951200 is before bucket time 1766077200
Why it happens: Late-arriving data due to network delays, clock skew, or buffering
Impact: Record is skipped, processing continues normally
Assessment: This is a data quality issue, not a system error
Code Change Example:
// BEFORE (current code around line 172)
if (recordTime < m_CurrentBucketTime) {
LOG_ERROR(<< "Records must be in ascending time order. Record '"
<< record << "' time " << recordTime
<< " is before bucket time " << m_CurrentBucketTime);
return false;
}
// AFTER (proposed change)
if (recordTime < m_CurrentBucketTime) {
LOG_WARN(<< "Skipping out-of-order record: time " << recordTime
<< " is before current bucket " << m_CurrentBucketTime
<< " (record: " << record << ")");
return false;
}🔴 Priority 2: Debug Context Logs (167 occurrences)
File: lib/maths/common/COneOfNPrior.cc:465-466
Current Errors:
ERROR [COneOfNPrior.cc@464] Update failed ( -1.455835745095495e+01 -nan -6.400016527678773e+00 -1.797675157930967e+308 )
ERROR [COneOfNPrior.cc@465] samples = [986724.5, 1002733]
ERROR [COneOfNPrior.cc@466] weights = [[0.25, 1, 0.9322762, 1], [0.25, 1, 0.9073955, 1]]
Why it happens: These are diagnostic dumps that appear after numerical failures
Impact: Informational only - helps debugging
Assessment: Lines 465-466 should be DEBUG level, not ERROR
Code Change Example:
// BEFORE (current code around lines 464-466)
catch (const std::exception& e) {
LOG_ERROR(<< "Update failed " << minusBics);
LOG_ERROR(<< "samples = " << samples);
LOG_ERROR(<< "weights = " << weights);
failed = true;
}
// AFTER (proposed change)
catch (const std::exception& e) {
LOG_ERROR(<< "Update failed: " << e.what() << " minusBics=" << minusBics);
LOG_DEBUG(<< "Context: samples = " << samples);
LOG_DEBUG(<< "Context: weights = " << weights);
failed = true;
}🟡 Priority 3: Prior Update Failures with NaN (82 occurrences)
File: lib/maths/common/COneOfNPrior.cc:464
Current Error:
ERROR [COneOfNPrior.cc@464] Update failed ( -1.455835745095495e+01 -nan -6.400016527678773e+00 -1.797675157930967e+308 )
Why it happens: Numerical overflow/underflow in Bayesian model updates with extreme values
Impact: Update is skipped, model continues with previous state
Assessment: Recoverable condition - should be WARNING
Code Change Example:
// BEFORE
catch (const std::exception& e) {
LOG_ERROR(<< "Update failed " << minusBics);
LOG_ERROR(<< "samples = " << samples);
LOG_ERROR(<< "weights = " << weights);
failed = true;
}
// AFTER
catch (const std::exception& e) {
LOG_WARN(<< "Prior update failed due to numerical instability: "
<< e.what() << ", minusBics=" << minusBics
<< " (using previous model state)");
LOG_DEBUG(<< "Failed update context: samples=" << samples
<< ", weights=" << weights);
failed = true;
}🟡 Priority 4: Missing Bucket Counts (82 occurrences)
File: lib/api/CBucketGatherer.cc:486
Current Error:
ERROR [abnormal_number_file_writes] [autodetect/710] [CBucketGatherer.cc@486]
No counts available at 1766145600, current bucket = [1766149200,1766152800)
Why it happens: Data gaps, sparse data, or timing mismatches
Impact: Bucket is skipped, analysis continues with next bucket
Assessment: Expected with sparse data - should be WARNING
Code Change Example:
// BEFORE
if (!haveCounts) {
LOG_ERROR(<< "No counts available at " << time
<< ", current bucket = [" << bucketStart << "," << bucketEnd << ")");
return false;
}
// AFTER
if (!haveCounts) {
LOG_WARN(<< "Skipping bucket due to missing counts at " << time
<< " (bucket window: [" << bucketStart << "," << bucketEnd
<< "]) - data may be sparse");
return false;
}🟢 Priority 5: Boost.Math Distribution Errors (3 occurrences)
File: lib/maths/common/CXMeansOnline1d.cc:468
Current Error:
ERROR [CXMeansOnline1d.cc@468] Bad category = (0.982161, -2.16579e+19, 0):
Error in function boost::math::normal_distribution<double>::normal_distribution:
Scale parameter is -nan, but must be > 0 !
Why it happens: Negative variance from numerical instability
Impact: Exception caught, winsorization skipped, processing continues
Assessment: Recoverable - should be WARNING
Code Change Example:
// BEFORE (around line 467-469)
} catch (const std::exception& e) {
LOG_ERROR(<< "Bad category = " << category << ": " << e.what());
}
// AFTER (with helper function)
} catch (const std::exception& e) {
// Boost.Math throws std::domain_error for invalid parameters
if (isNumericalError(e)) {
LOG_WARN(<< "Numerical error in category computation (recovered): "
<< category << " - " << e.what());
} else {
LOG_ERROR(<< "Unexpected error in category computation: "
<< category << " - " << e.what());
}
}
// Helper function to add at top of file in anonymous namespace
namespace {
bool isNumericalError(const std::exception& e) {
std::string msg = e.what();
return msg.find("boost::math") != std::string::npos ||
msg.find("Scale parameter") != std::string::npos ||
msg.find("-nan") != std::string::npos ||
msg.find("-inf") != std::string::npos;
}
} // anonymous namespaceImplementation Guide for New Developers
🔍 Step 1: Find the Files
All files are under the repository root. Use these paths:
# API/Job processing
lib/api/CAnomalyJob.cc # Out-of-order records
lib/api/CBucketGatherer.cc # Missing bucket counts
# Math/Statistics
lib/maths/common/COneOfNPrior.cc # Prior updates, debug logs
lib/maths/common/CXMeansOnline1d.cc # Boost.Math errors
lib/maths/common/CMultimodalPrior.cc # Mode split messages
lib/maths/common/CResourceMonitor.cc # Memory stability check
# Controller (process management)
bin/controller/CCommandProcessor.cc # Process kill failures
bin/controller/CDetachedProcessSpawner.cc # Child process checks🔧 Step 2: Understand the Logging Macros
The codebase uses these logging macros (defined in include/core/LogMacros.h):
LOG_TRACE(<< message) // Very verbose, development only
LOG_DEBUG(<< message) // Detailed diagnostic info
LOG_INFO(<< message) // Informational messages
LOG_WARN(<< message) // Warning - unexpected but handled
LOG_ERROR(<< message) // Error - something failed
LOG_FATAL(<< message) // Fatal - about to crash
// Note the << operator - these use C++ stream syntax
LOG_WARN(<< "Counter: " << count << ", Max: " << max);🎯 Step 3: Making Changes
General Pattern:
- Locate the LOG_ERROR statement using grep or search
- Check the context - is the error caught and handled?
- Change LOG_ERROR to LOG_WARN if recoverable
- Improve the message to explain what's happening
- Consider adding LOG_DEBUG for additional context
Example Search Commands:
# Find all LOG_ERROR statements in a file
grep -n "LOG_ERROR" lib/maths/common/COneOfNPrior.cc
# See context around line 464
sed -n '460,470p' lib/maths/common/COneOfNPrior.cc📝 Step 4: Testing Your Changes
After making changes, verify:
# Build the project (from repository root)
cmake -S . -B build
cmake --build build
# Run tests for the component you modified
cd build
ctest -R <test_name>
# For example, after modifying COneOfNPrior.cc:
ctest -R COneOfNPriorTestFiles to Modify (Priority Order)
🔴 High Priority (reduces ~60% of errors)
-
lib/api/CAnomalyJob.cc
- Line ~172: Change "Records must be in ascending time order" to LOG_WARN
- Estimated: 315 fewer errors
-
lib/maths/common/COneOfNPrior.cc
- Lines ~465-466: Change samples/weights dumps to LOG_DEBUG
- Estimated: 167 fewer errors
🟡 Medium Priority (reduces ~20% of errors)
-
lib/maths/common/COneOfNPrior.cc
- Line ~464: Change "Update failed" to LOG_WARN
- Line ~960: Change "Bad probability bounds" to LOG_WARN
- Estimated: 107 fewer errors
-
lib/api/CBucketGatherer.cc
- Line ~486: Change "No counts available" to LOG_WARN
- Estimated: 82 fewer errors
🟢 Lower Priority (reduces ~14% of errors)
-
lib/maths/common/CXMeansOnline1d.cc
- Line ~468: Add helper function, change to LOG_WARN for numerical errors
- Line ~530: Change "Expected 2-split" to LOG_WARN
-
lib/maths/common/CMultimodalPrior.cc
- Lines ~1236, 1454: Change SPLIT messages to LOG_DEBUG
-
bin/controller/CCommandProcessor.cc & CDetachedProcessSpawner.cc
- Lines ~119, ~109: Change process kill failures to LOG_WARN
-
lib/maths/common/CResourceMonitor.cc
- Line ~334: Change memory stability check to LOG_WARN