Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Commit

Permalink
Add alerts for mean db blocked seconds (#22822)
Browse files Browse the repository at this point in the history
Warn if the average database block time is over 5ms, alert if it exceeds 10ms
  • Loading branch information
daxmc99 authored Jul 16, 2021
1 parent 5905c39 commit 7eb956e
Show file tree
Hide file tree
Showing 4 changed files with 163 additions and 5 deletions.
144 changes: 144 additions & 0 deletions doc/admin/observability/alert_solutions.md
Original file line number Diff line number Diff line change
Expand Up @@ -842,6 +842,30 @@ To learn more about Sourcegraph's alerting and how to set up alerts, see [our al

<br />

## frontend: mean_blocked_seconds_per_conn_request

<p class="subtitle">mean blocked seconds per conn request</p>

**Descriptions**

- <span class="badge badge-warning">warning</span> frontend: 0.05s+ mean blocked seconds per conn request for 5m0s
- <span class="badge badge-critical">critical</span> frontend: 0.1s+ mean blocked seconds per conn request for 10m0s

**Possible solutions**

- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

```json
"observability.silenceAlerts": [
"warning_frontend_mean_blocked_seconds_per_conn_request",
"critical_frontend_mean_blocked_seconds_per_conn_request"
]
```

<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />

## frontend: internal_indexed_search_error_responses

<p class="subtitle">internal indexed search error responses every 5m</p>
Expand Down Expand Up @@ -1453,6 +1477,30 @@ To learn more about Sourcegraph's alerting and how to set up alerts, see [our al

<br />

## gitserver: mean_blocked_seconds_per_conn_request

<p class="subtitle">mean blocked seconds per conn request</p>

**Descriptions**

- <span class="badge badge-warning">warning</span> gitserver: 0.05s+ mean blocked seconds per conn request for 5m0s
- <span class="badge badge-critical">critical</span> gitserver: 0.1s+ mean blocked seconds per conn request for 10m0s

**Possible solutions**

- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

```json
"observability.silenceAlerts": [
"warning_gitserver_mean_blocked_seconds_per_conn_request",
"critical_gitserver_mean_blocked_seconds_per_conn_request"
]
```

<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />

## gitserver: container_cpu_usage

<p class="subtitle">container cpu usage total (1m average) across all cores by instance</p>
Expand Down Expand Up @@ -2402,6 +2450,30 @@ To learn more about Sourcegraph's alerting and how to set up alerts, see [our al

<br />

## precise-code-intel-worker: mean_blocked_seconds_per_conn_request

<p class="subtitle">mean blocked seconds per conn request</p>

**Descriptions**

- <span class="badge badge-warning">warning</span> precise-code-intel-worker: 0.05s+ mean blocked seconds per conn request for 5m0s
- <span class="badge badge-critical">critical</span> precise-code-intel-worker: 0.1s+ mean blocked seconds per conn request for 10m0s

**Possible solutions**

- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

```json
"observability.silenceAlerts": [
"warning_precise-code-intel-worker_mean_blocked_seconds_per_conn_request",
"critical_precise-code-intel-worker_mean_blocked_seconds_per_conn_request"
]
```

<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />

## precise-code-intel-worker: frontend_internal_api_error_responses

<p class="subtitle">frontend-internal API error responses every 5m by route</p>
Expand Down Expand Up @@ -3210,6 +3282,30 @@ To learn more about Sourcegraph's alerting and how to set up alerts, see [our al

<br />

## worker: mean_blocked_seconds_per_conn_request

<p class="subtitle">mean blocked seconds per conn request</p>

**Descriptions**

- <span class="badge badge-warning">warning</span> worker: 0.05s+ mean blocked seconds per conn request for 5m0s
- <span class="badge badge-critical">critical</span> worker: 0.1s+ mean blocked seconds per conn request for 10m0s

**Possible solutions**

- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

```json
"observability.silenceAlerts": [
"warning_worker_mean_blocked_seconds_per_conn_request",
"critical_worker_mean_blocked_seconds_per_conn_request"
]
```

<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />

## worker: frontend_internal_api_error_responses

<p class="subtitle">frontend-internal API error responses every 5m by route</p>
Expand Down Expand Up @@ -4163,6 +4259,30 @@ with your code hosts connections or networking issues affecting communication wi

<br />

## repo-updater: mean_blocked_seconds_per_conn_request

<p class="subtitle">mean blocked seconds per conn request</p>

**Descriptions**

- <span class="badge badge-warning">warning</span> repo-updater: 0.05s+ mean blocked seconds per conn request for 5m0s
- <span class="badge badge-critical">critical</span> repo-updater: 0.1s+ mean blocked seconds per conn request for 10m0s

**Possible solutions**

- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

```json
"observability.silenceAlerts": [
"warning_repo-updater_mean_blocked_seconds_per_conn_request",
"critical_repo-updater_mean_blocked_seconds_per_conn_request"
]
```

<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />

## repo-updater: container_cpu_usage

<p class="subtitle">container cpu usage total (1m average) across all cores by instance</p>
Expand Down Expand Up @@ -5966,6 +6086,30 @@ with your code hosts connections or networking issues affecting communication wi

<br />

## executor-queue: mean_blocked_seconds_per_conn_request

<p class="subtitle">mean blocked seconds per conn request</p>

**Descriptions**

- <span class="badge badge-warning">warning</span> executor-queue: 0.05s+ mean blocked seconds per conn request for 5m0s
- <span class="badge badge-critical">critical</span> executor-queue: 0.1s+ mean blocked seconds per conn request for 10m0s

**Possible solutions**

- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

```json
"observability.silenceAlerts": [
"warning_executor-queue_mean_blocked_seconds_per_conn_request",
"critical_executor-queue_mean_blocked_seconds_per_conn_request"
]
```

<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />

## executor-queue: frontend_internal_api_error_responses

<p class="subtitle">frontend-internal API error responses every 5m by route</p>
Expand Down
12 changes: 12 additions & 0 deletions doc/admin/observability/dashboards.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,8 @@ This panel indicates idle.

This panel indicates mean blocked seconds per conn request.

> NOTE: Alerts related to this panel are documented in the [alert solutions reference](./alert_solutions.md#frontend-mean-blocked-seconds-per-conn-request).
<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />
Expand Down Expand Up @@ -978,6 +980,8 @@ This panel indicates idle.

This panel indicates mean blocked seconds per conn request.

> NOTE: Alerts related to this panel are documented in the [alert solutions reference](./alert_solutions.md#gitserver-mean-blocked-seconds-per-conn-request).
<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />
Expand Down Expand Up @@ -1641,6 +1645,8 @@ This panel indicates idle.

This panel indicates mean blocked seconds per conn request.

> NOTE: Alerts related to this panel are documented in the [alert solutions reference](./alert_solutions.md#precise-code-intel-worker-mean-blocked-seconds-per-conn-request).
<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />
Expand Down Expand Up @@ -2171,6 +2177,8 @@ This panel indicates idle.

This panel indicates mean blocked seconds per conn request.

> NOTE: Alerts related to this panel are documented in the [alert solutions reference](./alert_solutions.md#worker-mean-blocked-seconds-per-conn-request).
<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />
Expand Down Expand Up @@ -2739,6 +2747,8 @@ This panel indicates idle.

This panel indicates mean blocked seconds per conn request.

> NOTE: Alerts related to this panel are documented in the [alert solutions reference](./alert_solutions.md#repo-updater-mean-blocked-seconds-per-conn-request).
<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />
Expand Down Expand Up @@ -3912,6 +3922,8 @@ This panel indicates idle.

This panel indicates mean blocked seconds per conn request.

> NOTE: Alerts related to this panel are documented in the [alert solutions reference](./alert_solutions.md#executor-queue-mean-blocked-seconds-per-conn-request).
<sub>*Managed by the [Sourcegraph Core application team](https://about.sourcegraph.com/handbook/engineering/core-application).*</sub>

<br />
Expand Down
10 changes: 6 additions & 4 deletions monitoring/definitions/shared/dbconns.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package shared

import (
"fmt"
"time"

"github.com/sourcegraph/sourcegraph/monitoring/monitoring"
)
Expand Down Expand Up @@ -61,10 +62,11 @@ func DatabaseConnectionsMonitoring(app string) []monitoring.Row {
Description: "mean blocked seconds per conn request",
Query: fmt.Sprintf(`sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name=%q}[5m])) / `+
`sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name=%q}[5m]))`, app, app),
Panel: monitoring.Panel().LegendFormat("dbname={{db_name}}").Unit(monitoring.Seconds),
NoAlert: true,
Owner: monitoring.ObservableOwnerCoreApplication,
Interpretation: "none",
Panel: monitoring.Panel().LegendFormat("dbname={{db_name}}").Unit(monitoring.Seconds),
Warning: monitoring.Alert().GreaterOrEqual(0.05, nil).For(5 * time.Minute),
Critical: monitoring.Alert().GreaterOrEqual(0.10, nil).For(10 * time.Minute),
Owner: monitoring.ObservableOwnerCoreApplication,
PossibleSolutions: "none",
},
},
{
Expand Down
2 changes: 1 addition & 1 deletion monitoring/monitoring/monitoring.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ func (c *Container) validate() error {
return errors.Errorf("Title must be in Title Case; found \"%s\" want \"%s\"", c.Title, strings.Title(c.Title))
}
if c.Description != withPeriod(c.Description) || c.Description != upperFirst(c.Description) {
return errors.Errorf("Description must be sentence starting with an uppercas eletter and ending with period; found \"%s\"", c.Description)
return errors.Errorf("Description must be sentence starting with an uppercase letter and ending with period; found \"%s\"", c.Description)
}
for i, g := range c.Groups {
if err := g.validate(); err != nil {
Expand Down

0 comments on commit 7eb956e

Please sign in to comment.