Our alerts #41
jlangy
announced in
Operations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview
Our alerting for sso-keycloak uses sysdig for monitoring and alerting, which integrates with opsGenie for escalating issues. Please use this thread to recommend new alerts, procedures or ask questions!
The Sysdig alerts and dashboards are managed exclusively through the terraform files stored in the sso-sysdig repo. Any update to them made in the Sysdig GUI will be overwritten by the next PR to main branch in that repos. Further documentation can be found in the sso-sysdig repo,
For team members with access to our private repos, you can find more information on the team's bcgov-c wiki.
Tools
1. Low level alerts are only escalating through opsgenie. Higher priority alerts go through email and Rocketchat as well
Alerts
Ready Pods
Due to a significant number of false positives we were getting from these alerts we created a new set of alerts (July 2022). These alerts measure the difference between the number of desired pods and the number of ready pods. If sysdig fails on a given kubernetes node and does not report the status of a pod on it, it will not trigger an alert. Meaning there will be no false positive triggered.
A recent upgrade to sysdig made pod count alerts prone to false triggers. These alerts were disconnected from opsgenie, though they still send an alert to rocketchat with the prefix
OLD ALERT
. Our instance monitors an rh-sso deployment as well as a Patroni stateful set, and alerts are set up based on the number of available pods.CPU usage
CPU has been our best indicator to date of cluster health. We monitor the CPU usage of our rh-sso deployment for spikes.
Available Space
Alerts are setup to notify when our database volumes are approaching maximum capacity
Other
These alerts have been setup earlier, but may fail to measure what we want. We are taking the approach of turning them off if they start to false fire, and we are leaving them active for the time being.
Beta Was this translation helpful? Give feedback.
All reactions