Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Logging Assessment and Improvements #836

Open
adambuttrick opened this issue Feb 14, 2025 · 0 comments
Open

[EPIC] Logging Assessment and Improvements #836

adambuttrick opened this issue Feb 14, 2025 · 0 comments

Comments

@adambuttrick
Copy link

adambuttrick commented Feb 14, 2025

Background

EZID is experiencing critical disk space issues due to its inefficient logging practices and error handling. This has led to various workarounds that result in operational risks.

Current Problems

System Issues

  • Disk space constantly under pressure due to excessive and errant logging
  • Systemd journal being written to app disk (2-year-old workaround)
  • Current log generation rate: ~1.5GB per day
  • Error volume prevents proper log ingestion into OpenSearch
  • Disk space incidents require emergency pruning by Ashley

Code Issues

  • Widespread use of log.exception() instead of appropriate error levels resulting in full stack traces being logged for expected conditionss (e.g., 404s from link checker)
  • Dynamic user-provided identifiers/links can cause unpredictable log volume growth

Technical Debt

  • Cron jobs copying journald logs to app disk
  • 30-day retention policy creating storage pressure
  • Lack of log rotation and compression strategy

Short-Term Mitigation

  1. Storage Management
    • Cleared additional log space
    • Suspended data sync crontab temporarily

Long-Term Solutions

Code Improvements

  1. Error Handling Refactor
    • Audit all log.exception() usage across codebase
    • Implement appropriate error level logging
    • Create logging standards documentation

Infrastructure Improvements

  1. Log Management

    • Return to standard journald configuration, rotation, and retention policies
    • Review what is logged in var/log/ezid. ecs_json, request, and trace logs all seem to contain duplicate info
  2. Monitoring and Alerts

    • Create early warning system for log volume issues
    • Set up trend analysis for log growth

Success Metrics

  • Reduced daily log volume
  • No disk space incidents
  • All logs ingested into OpenSearch (within reasonable error threshold)
  • Reduced error noise in logging
  • Clear separation between actual errors and expected conditions in codebase
@adambuttrick adambuttrick changed the title [EPIC] [EPIC] Logging Assessment and Improvement Feb 14, 2025
@adambuttrick adambuttrick changed the title [EPIC] Logging Assessment and Improvement [EPIC] Logging Assessment and Improvements Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant