This repository contains the following definitions for Konflux:
- Prometheus alerting rules (deployed to RHOBS)
- Grafana dashboards (deployed to AppSRE's Grafana)
- Availability exporters
The repository contains Prometheus alert rules files for monitoring Konflux data plane clusters along with their tests.
The different alerting rules in this repository are:
These Alert rules are defined to monitor and alert if the konflux_up
metric is missing
for all expected permutations of the service
and check
labels across different environments.
SLO (Service Level Objective) alert rules are rules defined to monitor and alert when a service or system is not meeting its specified service level objectives.
Apply the slo
label to alerts directly associated with Service Level Objectives.
These alerts typically indicate issues affecting the performance or reliability of the service.
Using the slo
label facilitates quicker incident response by
promptly identifying and addressing issues that impact service level objectives.
Apply slo: "true"
under labels section of any alerting rule.
labels:
severity: critical
slo: "true"
SLO alerts should be labeled with severity: critical
Teams receive updates on alerts relevant to them through Slack notifications, where the team's handle is tagged in the alert message.
Apply the alert_team_handle
and team
annotations to SLO alerts in order to get notified about them.
Apply the alert_team_handle
key to the annotations section of any alerting rule,
with the relevant team's Slack group handle.
The format of the Slack handle is: <!subteam^-slack_group_id->
(e.g: <!subteam^S041261DDEW>
);
To obtain the Slack group ID, click on the team's group handle, then click the three dots, and select "Copy group ID."
make sure to also add the team
annotation with the name of the relevant team for readability.
annotations:
summary: "PipelineRunFinish to SnapshotInProgress time exceeded"
alert_team_handle: <!subteam^S04S21ECL8K>
team: o11y
Alert rules for data plane clusters and recording rules are being deployed by app-interface to RHOBS, to where the metrics are also being forwarded. For deploying the alert rules and recording rules, app-interface references the location of the rules together with a git reference - branch name or commit hash.
It holds separate references to both staging and production RHOBS instances (monitoring Konflux staging and production deployments). For both environments, we maintain the reference to the rules as a commit hash (rather than a branch). This means that any changes to the rules will not take effect until the references are updated.
Steps for updating the rules:
- Merge the necessary changes to this repository - alerts, recording rules and tests.
- The
data plane staging environment,
the
recording rules staging environment
in app-interface are referencing to the
main
branch ino11y
repository and will be automatically updated with the new changes. - Once merged and ready to be promoted to production, update the data plane production environment and/or the recording rules production environment reference in app-interface to the commit hash of the changes you made.
Refer to the app-interface instructions to learn how to develop AppSRE dashboards for Konflux. This repository serves as versioned storage for the dashboard definitions and nothing more.
Dashboards are automatically deployed to stage AppSRE Grafana when merged into the main
branch.
Deploying to production requires an update of a commit
reference
in app-interface.
Only a subset of the metrics and labels available within the Konflux clusters is forwarded to RHOBS. If additional metrics or labels are needed, add them by following the steps described in the monitoring stack documentation
In order to be able to evaluate the overall availability of the Konflux ecosystem, we need to be able to establish the availability of each of its components.
By leveraging the existing Konflux monitoring stack based on Prometheus, we create Prometheus exporters that generate metrics that are scraped by the User Workload Monitoring Prometheus instance and remote-written to RHOBS.
The o11y team provides an example availability exporter that can be used as reference, especially in the case in which the exporter is external to the code it's monitoring.
For more detailed documentation on Availability exporters
Recording rules allow us to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Recording rules are the go-to approach for speeding up the performance of queries that take too long to return. When other teams want to go with their own metrics format for exporters they need to adapt to desired metric form by translating it using a recording rule.
These recording rules should be put in the rhobs/recording folder.
The standard format is single availability metric konflux_up
with labels service
and check
.
Each time series will have the service and check labels for the name of the originating service
and availability check it performed, respectively. The metric konflux_up should return either 0
or 1 based on the availability of the component/service. If the service is up then the
metric should return 1 else 0.
Recording rule example provided here has below format
grafana_ds_up(check=prometheus-appstudio-ds) -> konflux_up(service=grafana, check=prometheus-appstudio-ds)
For more detailed documentation on recording rules
- Slack: #forum-konflux-o11y