Skip to content

[ML] Reduce Log Noise by Downgrading Recoverable Errors #2888

@valeriy42

Description

@valeriy42

ML C++ components are generating excessive ERROR-level log messages for recoverable conditions. Production log analysis shows 94% of ERROR messages (764 out of 810) are non-critical issues that should be warnings or debug messages. This makes it difficult to identify genuine system failures.

Background

Problem Statement

In a recent log evaluation, we found 810 ERROR-level messages, but only ~46 represent actual system failures. The remaining 764 are:

  • Recoverable numerical issues (handled gracefully)
  • Data quality problems (late records, missing data)
  • Debug information incorrectly logged as errors
  • Expected edge cases in statistical computations

Error Categories & Impact

📊 Summary Table

Category Count % Current Proposed Priority
Out-of-order records 315 39% ERROR WARN 🔴 High
Debug context logs 167 21% ERROR DEBUG 🔴 High
Prior update failures (NaN) 82 10% ERROR WARN 🟡 Medium
Missing bucket counts 82 10% ERROR WARN 🟡 Medium
Process cleanup failures 58 7% ERROR WARN 🟢 Low
Bad probability bounds 25 3% ERROR WARN 🟢 Low
Multimodal split info 16 2% ERROR DEBUG 🟢 Low
Memory check timing 13 2% ERROR WARN 🟢 Low
Boost.Math errors 3 <1% ERROR WARN 🟢 Low
Other validation errors 3 <1% ERROR WARN 🟢 Low

Detailed Error Categories with Examples

🔴 Priority 1: Out-of-Order Records (315 occurrences)

File: lib/api/CAnomalyJob.cc:172

Current Error:

ERROR [file_abnormal_file_writes] [autodetect/15730] [CAnomalyJob.cc@172] 
Records must be in ascending time order. Record 'host.name,file.path,@timestamp,file.name,.,process.name,user.name
WSAMZN-CHCJT864,\Device\HarddiskVolume2\...,1765951200,...' 
time 1765951200 is before bucket time 1766077200

Why it happens: Late-arriving data due to network delays, clock skew, or buffering
Impact: Record is skipped, processing continues normally
Assessment: This is a data quality issue, not a system error

Code Change Example:

// BEFORE (current code around line 172)
if (recordTime < m_CurrentBucketTime) {
    LOG_ERROR(<< "Records must be in ascending time order. Record '"
              << record << "' time " << recordTime 
              << " is before bucket time " << m_CurrentBucketTime);
    return false;
}

// AFTER (proposed change)
if (recordTime < m_CurrentBucketTime) {
    LOG_WARN(<< "Skipping out-of-order record: time " << recordTime 
             << " is before current bucket " << m_CurrentBucketTime 
             << " (record: " << record << ")");
    return false;
}

🔴 Priority 2: Debug Context Logs (167 occurrences)

File: lib/maths/common/COneOfNPrior.cc:465-466

Current Errors:

ERROR [COneOfNPrior.cc@464] Update failed ( -1.455835745095495e+01 -nan -6.400016527678773e+00 -1.797675157930967e+308 )
ERROR [COneOfNPrior.cc@465] samples = [986724.5, 1002733]
ERROR [COneOfNPrior.cc@466] weights = [[0.25, 1, 0.9322762, 1], [0.25, 1, 0.9073955, 1]]

Why it happens: These are diagnostic dumps that appear after numerical failures
Impact: Informational only - helps debugging
Assessment: Lines 465-466 should be DEBUG level, not ERROR

Code Change Example:

// BEFORE (current code around lines 464-466)
catch (const std::exception& e) {
    LOG_ERROR(<< "Update failed " << minusBics);
    LOG_ERROR(<< "samples = " << samples);
    LOG_ERROR(<< "weights = " << weights);
    failed = true;
}

// AFTER (proposed change)
catch (const std::exception& e) {
    LOG_ERROR(<< "Update failed: " << e.what() << " minusBics=" << minusBics);
    LOG_DEBUG(<< "Context: samples = " << samples);
    LOG_DEBUG(<< "Context: weights = " << weights);
    failed = true;
}

🟡 Priority 3: Prior Update Failures with NaN (82 occurrences)

File: lib/maths/common/COneOfNPrior.cc:464

Current Error:

ERROR [COneOfNPrior.cc@464] Update failed ( -1.455835745095495e+01 -nan -6.400016527678773e+00 -1.797675157930967e+308 )

Why it happens: Numerical overflow/underflow in Bayesian model updates with extreme values
Impact: Update is skipped, model continues with previous state
Assessment: Recoverable condition - should be WARNING

Code Change Example:

// BEFORE
catch (const std::exception& e) {
    LOG_ERROR(<< "Update failed " << minusBics);
    LOG_ERROR(<< "samples = " << samples);
    LOG_ERROR(<< "weights = " << weights);
    failed = true;
}

// AFTER
catch (const std::exception& e) {
    LOG_WARN(<< "Prior update failed due to numerical instability: " 
             << e.what() << ", minusBics=" << minusBics
             << " (using previous model state)");
    LOG_DEBUG(<< "Failed update context: samples=" << samples 
              << ", weights=" << weights);
    failed = true;
}

🟡 Priority 4: Missing Bucket Counts (82 occurrences)

File: lib/api/CBucketGatherer.cc:486

Current Error:

ERROR [abnormal_number_file_writes] [autodetect/710] [CBucketGatherer.cc@486] 
No counts available at 1766145600, current bucket = [1766149200,1766152800)

Why it happens: Data gaps, sparse data, or timing mismatches
Impact: Bucket is skipped, analysis continues with next bucket
Assessment: Expected with sparse data - should be WARNING

Code Change Example:

// BEFORE
if (!haveCounts) {
    LOG_ERROR(<< "No counts available at " << time 
              << ", current bucket = [" << bucketStart << "," << bucketEnd << ")");
    return false;
}

// AFTER  
if (!haveCounts) {
    LOG_WARN(<< "Skipping bucket due to missing counts at " << time 
             << " (bucket window: [" << bucketStart << "," << bucketEnd 
             << "]) - data may be sparse");
    return false;
}

🟢 Priority 5: Boost.Math Distribution Errors (3 occurrences)

File: lib/maths/common/CXMeansOnline1d.cc:468

Current Error:

ERROR [CXMeansOnline1d.cc@468] Bad category = (0.982161, -2.16579e+19, 0): 
Error in function boost::math::normal_distribution<double>::normal_distribution: 
Scale parameter is -nan, but must be > 0 !

Why it happens: Negative variance from numerical instability
Impact: Exception caught, winsorization skipped, processing continues
Assessment: Recoverable - should be WARNING

Code Change Example:

// BEFORE (around line 467-469)
} catch (const std::exception& e) {
    LOG_ERROR(<< "Bad category = " << category << ": " << e.what());
}

// AFTER (with helper function)
} catch (const std::exception& e) {
    // Boost.Math throws std::domain_error for invalid parameters
    if (isNumericalError(e)) {
        LOG_WARN(<< "Numerical error in category computation (recovered): " 
                 << category << " - " << e.what());
    } else {
        LOG_ERROR(<< "Unexpected error in category computation: " 
                  << category << " - " << e.what());
    }
}

// Helper function to add at top of file in anonymous namespace
namespace {
bool isNumericalError(const std::exception& e) {
    std::string msg = e.what();
    return msg.find("boost::math") != std::string::npos ||
           msg.find("Scale parameter") != std::string::npos ||
           msg.find("-nan") != std::string::npos ||
           msg.find("-inf") != std::string::npos;
}
} // anonymous namespace

Implementation Guide for New Developers

🔍 Step 1: Find the Files

All files are under the repository root. Use these paths:

# API/Job processing
lib/api/CAnomalyJob.cc                  # Out-of-order records
lib/api/CBucketGatherer.cc              # Missing bucket counts

# Math/Statistics  
lib/maths/common/COneOfNPrior.cc        # Prior updates, debug logs
lib/maths/common/CXMeansOnline1d.cc     # Boost.Math errors
lib/maths/common/CMultimodalPrior.cc    # Mode split messages
lib/maths/common/CResourceMonitor.cc    # Memory stability check

# Controller (process management)
bin/controller/CCommandProcessor.cc     # Process kill failures
bin/controller/CDetachedProcessSpawner.cc  # Child process checks

🔧 Step 2: Understand the Logging Macros

The codebase uses these logging macros (defined in include/core/LogMacros.h):

LOG_TRACE(<< message)   // Very verbose, development only
LOG_DEBUG(<< message)   // Detailed diagnostic info
LOG_INFO(<< message)    // Informational messages
LOG_WARN(<< message)    // Warning - unexpected but handled
LOG_ERROR(<< message)   // Error - something failed
LOG_FATAL(<< message)   // Fatal - about to crash

// Note the << operator - these use C++ stream syntax
LOG_WARN(<< "Counter: " << count << ", Max: " << max);

🎯 Step 3: Making Changes

General Pattern:

  1. Locate the LOG_ERROR statement using grep or search
  2. Check the context - is the error caught and handled?
  3. Change LOG_ERROR to LOG_WARN if recoverable
  4. Improve the message to explain what's happening
  5. Consider adding LOG_DEBUG for additional context

Example Search Commands:

# Find all LOG_ERROR statements in a file
grep -n "LOG_ERROR" lib/maths/common/COneOfNPrior.cc

# See context around line 464
sed -n '460,470p' lib/maths/common/COneOfNPrior.cc

📝 Step 4: Testing Your Changes

After making changes, verify:

# Build the project (from repository root)
cmake -S . -B build
cmake --build build

# Run tests for the component you modified
cd build
ctest -R <test_name>

# For example, after modifying COneOfNPrior.cc:
ctest -R COneOfNPriorTest

Files to Modify (Priority Order)

🔴 High Priority (reduces ~60% of errors)

  1. lib/api/CAnomalyJob.cc

    • Line ~172: Change "Records must be in ascending time order" to LOG_WARN
    • Estimated: 315 fewer errors
  2. lib/maths/common/COneOfNPrior.cc

    • Lines ~465-466: Change samples/weights dumps to LOG_DEBUG
    • Estimated: 167 fewer errors

🟡 Medium Priority (reduces ~20% of errors)

  1. lib/maths/common/COneOfNPrior.cc

    • Line ~464: Change "Update failed" to LOG_WARN
    • Line ~960: Change "Bad probability bounds" to LOG_WARN
    • Estimated: 107 fewer errors
  2. lib/api/CBucketGatherer.cc

    • Line ~486: Change "No counts available" to LOG_WARN
    • Estimated: 82 fewer errors

🟢 Lower Priority (reduces ~14% of errors)

  1. lib/maths/common/CXMeansOnline1d.cc

    • Line ~468: Add helper function, change to LOG_WARN for numerical errors
    • Line ~530: Change "Expected 2-split" to LOG_WARN
  2. lib/maths/common/CMultimodalPrior.cc

    • Lines ~1236, 1454: Change SPLIT messages to LOG_DEBUG
  3. bin/controller/CCommandProcessor.cc & CDetachedProcessSpawner.cc

    • Lines ~119, ~109: Change process kill failures to LOG_WARN
  4. lib/maths/common/CResourceMonitor.cc

    • Line ~334: Change memory stability check to LOG_WARN

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions