Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(RHTAPWATCH-1261): updating README #374

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 92 additions & 80 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,18 @@
# Konflux Observability

This repository contains the following definitions for Konflux:
* Prometheus alerting rules (deployed to RHOBS)
* Grafana dashboards (deployed to AppSRE's Grafana)
* Availability exporters
* Prometheus rules (deployed to RHOBS)
* Grafana dashboards (deployed to AppSRE's Grafana)
* Availability exporters

## Alerting Rules

The repository contains Prometheus alert rules [files](rhobs/alerting) for monitoring
Konflux data plane clusters along with their [tests](test/promql).

Konflux data plane clusters along with their
[tests](test/promql/tests/data_plane/).

The different alerting rules in this repository are:

## Data Plane Alerts

* [**Alert Rule Unschedulable**](https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-unschedualablePods.md)

* [**Alert Rule CrashLoopBackOff**](https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-crashLoopBackOff.md?ref_type=heads)

* [**Alert Rule PodNotReady**](https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-PodNotReady.md?ref_type=heads)

* [**Alert Rule PersistentVolumeIssues**](https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-pesistentVolumeIssues.md?ref_type=heads)

* [**Alert Rule QuotaExceeded**](https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-QuotaExceeded.md)

### Availability Metric Alerts

These Alert rules are defined to monitor and alert if the `konflux_up` metric is missing
for all expected permutations of the `service` and `check` labels across different environments.

### SLO Alerts

SLO (Service Level Objective) alert rules are rules defined to monitor and alert
Expand All @@ -40,22 +23,36 @@ when a service or system is not meeting its specified service level objectives.
Apply the `slo` label to alerts directly associated with Service Level Objectives.
These alerts typically indicate issues affecting the performance or reliability of the service.

#### Benefits of using the `slo` Label:
#### Benefits of Using the `slo` Label:

Using the `slo` label facilitates quicker incident response by
promptly identifying and addressing issues that impact service level objectives.

#### How to apply the `slo` Label:
#### How to Apply the `slo` Label:

Apply `slo: "true"` under labels section of any alerting rule.
```
labels:
severity: critical
slo: "true"
```
```yaml
labels:
severity: critical
slo: "true"
```
##### Note
SLO alerts should be labeled with `severity: critical`

### Miscellaneous Alerts

Alerts lacking the `slo: "true"` label are considered non-SLO, miscellaneous or misc
alerts.

Such alerting rules are intended to notify regarding issues requiring attention, but are
not directly affecting Service Level Objectives defined by any service.

#### Availability Metric Alerts

These are non-SLO alerts defined to monitor and alert if the `konflux_up` metric is
missing for any expected permutations of the `service` and `check` labels across
different environments.

### Alerts Tagging

Teams receive updates on alerts relevant to them through Slack notifications,
Expand All @@ -65,46 +62,59 @@ where the team's handle is tagged in the alert message.

Apply the `alert_team_handle` and `team` annotations to SLO alerts in order to get notified about them.

#### How to apply the `alert_team_handle` Annotation:
#### How to Apply the `alert_team_handle` Annotation:

Apply the `alert_team_handle` key to the annotations section of any alerting rule,
with the relevant team's Slack group handle.
The format of the Slack handle is: `<!subteam^-slack_group_id->` (e.g: `<!subteam^S041261DDEW>`);
To obtain the Slack group ID, click on the team's group handle, then click the three dots, and select "Copy group ID."
The format of the Slack handle is: `<!subteam^-slack_group_id->` (e.g.:
`<!subteam^S041261DDEW>`);
To obtain the Slack group ID, click on the team's group handle, then click the three
dots, and select "Copy group ID."

Make sure to also add the `team` annotation with the name of the relevant team for readability.
```yaml
annotations:
summary: "PipelineRunFinish to SnapshotInProgress time exceeded"
alert_team_handle: <!subteam^S04S21ECL8K>
team: o11y
```

## Recording Rules

Recording rules allow us to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Recording rules are the
go-to approach for speeding up the performance of queries that take too long to return.

Rules located in the [recording rules directory](rhobs/recording/) are deployed to RHOBS
which makes them present in [AppSRE Grafana](https://grafana.app-sre.devshift.net/).

make sure to also add the `team` annotation with the name of the relevant team for readability.
```
annotations:
summary: "PipelineRunFinish to SnapshotInProgress time exceeded"
alert_team_handle: <!subteam^S04S21ECL8K>
team: o11y
```
Rules should be created together with the [unit tests](test/promql/tests/recording/).

### Updating Alert and Recording Rules
## Updating Alert and Recording Rules

Alert rules for data plane clusters and recording rules are being deployed by
app-interface to RHOBS, to where the metrics are also being forwarded. For deploying the
alert rules and recording rules, app-interface references the location of the rules together
with a git reference - branch name or commit hash.
alert rules and recording rules, app-interface references the location of the rules
together with a git reference - branch name or commit hash.

It holds separate references to both staging and production RHOBS instances (monitoring
Konflux staging and production deployments). For both environments, we maintain the
reference to the rules as a commit hash (rather than a branch). This means that any
changes to the rules will not take effect until the references are updated.
Konflux staging and production deployments).

The staging environment references the `main` of this repo, so rule changes reaching
that branch are automatically deployed to RHOBS.

The production environment keeps the reference to the rules as a commit hash (rather
than a branch). This means that any changes to the rules will not take effect until the
references are updated.

Steps for updating the rules:

1. Merge the necessary changes to this repository - alerts, recording rules and tests.
2. The
[data plane staging environment](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/stonesoup/cicd/saas-rhtap-rules.yaml#L35),
the
[recording rules staging environment](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/stonesoup/cicd/saas-rhtap-rules.yaml#L63)
in app-interface are referencing to the `main` branch in `o11y` repository and will be automatically updated with the new changes.
3. Once merged and ready to be promoted to production, update the
[data plane production environment](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/stonesoup/cicd/saas-rhtap-rules.yaml#L39)
2. Verify that the rules are visible as expected in AppSRE Grafana.
3. Once the changes are ready to be promoted to production, update the
[alerting rules production reference](https://gitlab.cee.redhat.com/service/app-interface/-/blob/c5bbcd98175450b4e51ed9e2d41bda394cea0f92/data/services/stonesoup/cicd/saas-rhtap-rules.yaml#L40)
and/or the
[recording rules production environment](https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/stonesoup/cicd/saas-rhtap-rules.yaml#L67)
reference in app-interface to the commit hash of the changes you made.
[recording rules production reference](https://gitlab.cee.redhat.com/service/app-interface/-/blob/c5bbcd98175450b4e51ed9e2d41bda394cea0f92/data/services/stonesoup/cicd/saas-rhtap-rules.yaml#L54)
in app-interface to the commit hash of the changes you made.

## Grafana Dashboards

Expand All @@ -113,24 +123,26 @@ https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/docs/app-sre/m
to learn how to develop AppSRE dashboards for Konflux. This repository serves as
versioned storage for the [dashboard definitions](dashboards/) and nothing more.

Dashboards are automatically deployed to [stage](https://grafana.stage.devshift.net) AppSRE Grafana when merged into the `main` branch.
Deploying to [production](https://grafana.app-sre.devshift.net/) requires an update of a commit
Dashboards are automatically deployed to [stage](https://grafana.stage.devshift.net)
AppSRE Grafana when merged into the `main` branch.
Deploying to [production](https://grafana.app-sre.devshift.net/) requires an update of a
commit
[reference](https://gitlab.cee.redhat.com/service/app-interface/-/blob/b03e4336a3223ec7b90dc9bc69707c9ee0ff9af6/data/services/stonesoup/cicd/saas-stonesoup-dashboards.yml#L37)
in app-interface.

## Adding Metrics and Labels

Only a subset of the metrics and labels available within the Konflux clusters is forwarded
to RHOBS. If additional metrics or labels are needed, add them by following the steps
described in the
[monitoring stack documentation](https://github.com/redhat-appstudio/infra-deployments/blob/main/components/monitoring/prometheus/README.md#federation-and-remote-write)
Only a subset of the metrics and labels available within the Konflux clusters is
forwarded to RHOBS. If additional metrics or labels are needed, add them by following
the steps described for
[Troubleshooting Missing Metrics and Labels](https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/monitoring/tshoot-missing-metrics.md?ref_type=heads)

## Availability exporters
## Availability Exporters

In order to be able to evaluate the overall availability of the Konflux ecosystem, we
need to be able to establish the availability of each of its components.

By leveraging the existing [Konflux monitoring stack](https://gitlab.cee.redhat.com/konflux/docs/documentation/-/blob/main/o11y/monitoring/monitoring.md)
By leveraging the existing [Konflux monitoring stack](https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/monitoring/monitoring.md?ref_type=heads)
based on Prometheus, we create Prometheus exporters that generate metrics that are
scraped by the User Workload Monitoring Prometheus instance and remote-written to RHOBS.

Expand All @@ -141,31 +153,31 @@ especially in the case in which the exporter is external to the code it's monito
- [Exporter code](https://github.com/redhat-appstudio/o11y/tree/main/exporters/dsexporter)
- [Exporter and service Monitor Kubernetes Resources](https://github.com/redhat-appstudio/o11y/tree/main/config/exporters/monitoring/grafana/base)

For more detailed documentation on [Availability exporters](https://gitlab.cee.redhat.com/konflux/docs/documentation/-/blob/main/o11y/monitoring/availability_exporters.md?ref_type=heads)
For more detailed documentation on [Availability exporters](https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/monitoring/availability_exporters.md?ref_type=heads)

## Recording Rules
### Availability Exporter Recording Rules

Recording rules allow us to precompute frequently needed or computationally expensive expressions
and save their result as a new set of time series. Recording rules are the go-to approach for
speeding up the performance of queries that take too long to return. When other teams want to go
with their own metrics format for exporters they need to adapt to desired metric form by
translating it using a recording rule.
When teams want to go with their own metrics format for exporters they need to adapt to
the standard metric format by translating it using recording rules.

These recording rules should be put in the [rhobs/recording folder](rhobs/recording/).

The standard format is single availability metric `konflux_up` with labels `service` and `check`.
Each time series will have the service and check labels for the name of the originating service
and availability check it performed, respectively. The metric konflux_up should return either 0
or 1 based on the availability of the component/service. If the service is up then the
metric should return 1 else 0.
The standard format is a single availability metric `konflux_up` with labels `service`
and `check`. Each time series will have the service and check labels for the name of the
originating service and the availability check it performed, respectively.

The metric konflux_up should return either `0` or `1` based on the availability of the
component/service. If the service is up then the metric should return `1` else `0`.

The recording rule [example](rhobs/recording/exporter_recording_rules.yml) provided here
has the below format:

[Recording rule example](rhobs/recording/exporter_recording_rules.yml) provided here has
below format
```
grafana_ds_up(check=prometheus-appstudio-ds) -> konflux_up(service=grafana, check=prometheus-appstudio-ds)
```
```
grafana_ds_up(check=prometheus-appstudio-ds) -> konflux_up(service=grafana, check=prometheus-appstudio-ds)
```

For more detailed documentation on [recording rules](https://docs.google.com/document/d/1Y72T10JGuJaeyeNexmS_qTHfDB8uxxq0zERRRSOZegg/edit?usp=sharing)
See detailed documentation on
[recording rules](https://docs.google.com/document/d/1Y72T10JGuJaeyeNexmS_qTHfDB8uxxq0zERRRSOZegg/edit?usp=sharing).

## Support

Expand Down