You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EZID is experiencing critical disk space issues due to its inefficient logging practices and error handling. This has led to various workarounds that result in operational risks.
Current Problems
System Issues
Disk space constantly under pressure due to excessive and errant logging
Systemd journal being written to app disk (2-year-old workaround)
Current log generation rate: ~1.5GB per day
Error volume prevents proper log ingestion into OpenSearch
Disk space incidents require emergency pruning by Ashley
Code Issues
Widespread use of log.exception() instead of appropriate error levels resulting in full stack traces being logged for expected conditionss (e.g., 404s from link checker)
Dynamic user-provided identifiers/links can cause unpredictable log volume growth
Technical Debt
Cron jobs copying journald logs to app disk
30-day retention policy creating storage pressure
Lack of log rotation and compression strategy
Short-Term Mitigation
Storage Management
Cleared additional log space
Suspended data sync crontab temporarily
Long-Term Solutions
Code Improvements
Error Handling Refactor
Audit all log.exception() usage across codebase
Implement appropriate error level logging
Create logging standards documentation
Infrastructure Improvements
Log Management
Return to standard journald configuration, rotation, and retention policies
Review what is logged in var/log/ezid. ecs_json, request, and trace logs all seem to contain duplicate info
Monitoring and Alerts
Create early warning system for log volume issues
Set up trend analysis for log growth
Success Metrics
Reduced daily log volume
No disk space incidents
All logs ingested into OpenSearch (within reasonable error threshold)
Reduced error noise in logging
Clear separation between actual errors and expected conditions in codebase
The text was updated successfully, but these errors were encountered:
Background
EZID is experiencing critical disk space issues due to its inefficient logging practices and error handling. This has led to various workarounds that result in operational risks.
Current Problems
System Issues
Code Issues
log.exception()
instead of appropriate error levels resulting in full stack traces being logged for expected conditionss (e.g., 404s from link checker)Technical Debt
Short-Term Mitigation
Long-Term Solutions
Code Improvements
log.exception()
usage across codebaseInfrastructure Improvements
Log Management
Monitoring and Alerts
Success Metrics
The text was updated successfully, but these errors were encountered: