The application expose the metrics in the prometheus format at 9800/metrics
.
Port 9800 is named metrics
in the Kubernetes Deployment and Service.
We're going to define a ServiceMonitor object in the Prometheus Operator values.yaml pointing to the Service.
To do that we define the matchLabels
with the label of the release, the namespace it will be living, and the endpoint we're going to scrape.
additionalServiceMonitors:
- name: billing-api
selector:
matchLabels:
app.kubernetes.io/instance: api
namespaceSelector:
matchNames:
- default
endpoints:
- port: metrics
interval: 10s
Having the raw metrics can be slow and expensive in terms of memory consumption. For this reason is recommended to generate Prometheus Rules to pre-calculate this information. By doing that you can also increase the speed of the dashboard and if decide to go to a long term storage save tons of space.
Other reasons are the queries with regex that are way more expensive. In a long dataset this query can end having prometheus running out of memory, to give you some context (I've +350 traefik backends [services in traefik ^v2]):
sum(rate(http_request_duration_seconds_count{job="api-billing", status_code=~"5.."}[1m]))
/ sum(rate(http_request_duration_seconds_count{job="api-billing"}[1m]))
To prevent this we generate a PrometheusRule
object, a CRD object from the Prometheus Operator that generate this rules in prometheus instances. Here an example with an increase in complexity:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ include "billing.fullname" . }}
labels:
app: prometheus-operator
release: api
{{- include "billing.labels" . | nindent 4 }}
spec:
groups:
- name: billing-api.rules
rules:
- expr: sum(increase(transaction_create_success[1m]))
record: billing:transactions:success:sum_increase_1m
- expr: sum(increase(transaction_create_success[1h]))
record: billing:transactions:success:sum_increase_1h
- expr: sum(rate(http_request_duration_seconds_count{job="api-billing"}[5m]))
record: billing:traffic:total_rate_5m
- expr: |-
sum(rate(http_request_duration_seconds_count{job="api-billing", status_code=~"5.."}[1m]))
/ sum(rate(http_request_duration_seconds_count{job="api-billing"}[1m]))
record: billing:traffic:server_error_rate_1m
- expr: |-
(
sum(rate(http_request_duration_seconds_bucket{le="0.3", job="api-billing"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="1.5", job="api-billing"}[5m]))
) / 2 / billing:traffic:total_rate_5m
record: billing:traffic:apdex_global_5m
To monitor the workers we start the monitor server to expose the metrics and follow the same rules to monitor the worker metrics.
To get all this information displayed in Grafana we define a dashboardProvider
in Grafana config of prometheus-operator, and a dashboard with the JSON object.
Remember to use the Rules defined above in the charts.
"targets": [
{
"expr": "billing:transactions:success:sum_increase_1m",
"legendFormat": "Success",
"refId": "A"
},
...
The best way I found to work this is edit the dashboard in the Grafana UI and export the dashboard in JSON by clicking in save and putting this object in the grafana.yaml config file. By doing this we've in the CVS the changes applied in the charts.
Prometheus Operator came with VERY important built in alerts you can look at http://localhost:9090/alerts. Some of this critical alerts are:
Throttling:
alert: CPUThrottlingHigh
expr: sum
by(container, pod, namespace) (increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m]))
/ sum by(container, pod, namespace) (increase(container_cpu_cfs_periods_total[5m]))
> (25 / 100)
for: 15m
labels:
severity: warning
annotations:
message: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace
}} for container {{ $labels.container }} in pod {{ $labels.pod }}.'
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh
KubePodCrashLooping:
alert: KubePodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[15m])
* 60 * 5 > 0
for: 15m
labels:
severity: critical
annotations:
message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }})
is restarting {{ printf "%.2f" $value }} times / 5 minutes.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping
Stuck Jobs
alert: KubeJobCompletion
expr: kube_job_spec_completions{job="kube-state-metrics",namespace=~".*"}
- kube_job_status_succeeded{job="kube-state-metrics",namespace=~".*"}
> 0
for: 1h
labels:
severity: warning
annotations:
message: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than
one hour to complete.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion
Because of this you can focus just on golden signals.
- Saturation
- CPU Usage and Idle CPU.
- Memory Usage, Limit and Idle. (Idle is important when running containers at scale as you want to waste the less as possible)
- Network, RX TX
- Latency
- AVG response time is just an indicator don't take it so serious.
- Focus in Apdex and Percentiles
- Errors
- Healthy vs unhealthy Pods
- Pod restarts
- Traffic
- Server Error rate
- Client Error rate
- Not Found rate
- Traffic by code, path
- Req/s or req/m
- Anomaly detections
- Z-Score
- Example:
- Current traffic deviation:
stddev(rate(http_request_duration_seconds_count{job="api-billing"}[5m]))
- Prev week traffic deviation:
stddev(rate(http_request_duration_seconds_count{job="api-billing"}[5m] offset 1w))
- Max stddev toleration:
stddev(rate(traefik_entrypoint_requests_total{job="traefik-frontend-prometheus"}[5m] offset 1w))*1.2
-> Fill below to: Min stddev toleration for cool colours - Min stddev toleration:
stddev(rate(traefik_entrypoint_requests_total{job="traefik-frontend-prometheus"}[5m] offset 1w))*0.8
- Alert:
abs(1-stddev(rate(http_request_duration_seconds_count{job="api-billing"}[5m]))/stddev(rate(http_request_duration_seconds_count{job="api-billing"}[5m] offset 1w))) > 0.2
// Trigger alert if traffic is deviated by more than 20% up or down vs last week
- Current traffic deviation:
There're 2 options to generate alerts in this stack
- Grafana routing to
alertmanager
- Prometheus Rules
Is up to you to decide what fits better fo you
One of the most important things when dealing with an alert is the triage system, and the visibility it exposes. In complex and/or distributed systems it's very useful to consider graphs to display the information.
There's a plugin for grafana call Diagram Panel where you can define the graph in mermaid syntax and link it to particular queries.
This is also included in the Grafana Dashboard
Display the information in the correct way is as important as have the information.
- Think always in the next personal looking at the chart like if has not context about what is looking
- Add descriptions in the charts to help to provide context about the numbers displayed
- Avoid logarithmic scales as much as you can. Use text panels to notify if not, so you can have a red alert text on top.
- Absolute numbers are useless 99% of the time
- Perspective matters. Start graphs at 0 it's important for of the time. Use the stddev or rate to monitor fluctuations.
- Avoid use the information of the metrics as a BI tool, doesn't fit always.
https://www.investopedia.com/terms/z/zscore.asp https://devconnected.com/the-definitive-guide-to-prometheus-in-2019/ https://prometheus.io/docs/introduction/overview/ https://prometheus.io/webtools/alerting/routing-tree-editor/ https://about.gitlab.com/blog/2019/07/23/anomaly-detection-using-prometheus/