-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add short-term analytical logging #81
Labels
Comments
Excerpt from logs for above incident
Impact was about 3 minutes of downtime according to monitoring. |
This was referenced Jul 26, 2021
raxod502
added
the
panic
Things that really need to be dealt with ASAP to keep the lights on
label
Jul 27, 2021
raxod502
removed
the
panic
Things that really need to be dealt with ASAP to keep the lights on
label
Aug 21, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Riju has been brought down temporarily a few times over the last week or two, which seems most likely to be due to people running pathological code. To understand and address the failure conditions, we need some visibility into what operations and usage patterns exactly are causing resource exhaustion. As such, we should add some analytical logging that will record what API operations are being called by which clients at which times, with a retention of perhaps a few days. This data will allow me to investigate the root cause of abnormal behavior that shows up in monitoring, e.g.:
The text was updated successfully, but these errors were encountered: