Skip to content

Commit

Permalink
Update mixin for TempoIngesterFlushes thresholds (#1354)
Browse files Browse the repository at this point in the history
* Update TempoIngesterFlushes thresholds and include warning

* Update changelog

* Replace tempo_ingester_failed_flushes_total with tempo_ingester_flush_failed_retries_total in mixin

* Adjust alert message wording slightly

* Adjust unhealthy metric and failed duration
  • Loading branch information
zalegrala authored Apr 18, 2022
1 parent df42ba8 commit d4c5a7c
Show file tree
Hide file tree
Showing 4 changed files with 36 additions and 7 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
* [CHANGE] Updated storage.trace.pool.queue_depth default from 200->10000. [#1345](https://github.com/grafana/tempo/pull/1345) (@joe-elliott)
* [CHANGE] Update alpine images to 3.15 [#1330](https://github.com/grafana/tempo/pull/1330) (@zalegrala)
* [CHANGE] Updated flags `-storage.trace.azure.storage-account-name` and `-storage.trace.s3.access_key` to no longer to be considered as secrets [#1356](https://github.com/grafana/tempo/pull/1356) (@simonswine)
* [CHANGE] Add warning threshold for TempoIngesterFlushes and adjust critical threshold [#1354](https://github.com/grafana/tempo/pull/1354) (@zalegrala)
* [CHANGE] Include lambda in serverless e2e tests [#1357](https://github.com/grafana/tempo/pull/1357) (@zalegrala)
* [CHANGE] Replace mixin TempoIngesterFlushes metric to only look at retries [#1354](https://github.com/grafana/tempo/pull/1354) (@zalegrala)
* [FEATURE]: v2 object encoding added. This encoding adds a start/end timestamp to every record to reduce proto marshalling and increase search speed.
**BREAKING CHANGE** After this rollout the distributors will use a new API on the ingesters. As such you must rollout all ingesters before rolling the
distributors. Also, during this period, the ingesters will use considerably more resources and as such should be scaled up (or incoming traffic should be
Expand Down
20 changes: 18 additions & 2 deletions operations/tempo-mixin/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -84,17 +84,33 @@
},
{
// wait 5m for failed flushes to self-heal using retries
alert: 'TempoIngesterFlushesFailing',
alert: 'TempoIngesterFlushesUnhealthy',
expr: |||
sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[1h])) > %s and
sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0
||| % [$._config.group_by_cluster, $._config.alerts.flushes_per_hour_failed, $._config.group_by_cluster],
'for': '5m',
labels: {
severity: 'warning',
},
annotations: {
message: 'Greater than %s flush retries have occurred in the past hour.' % $._config.alerts.flushes_per_hour_failed,
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing',
},
},
{
// wait 10m for failed flushes to self-heal using retries
alert: 'TempoIngesterFlushesFailing',
expr: |||
sum by (%s) (increase(tempo_ingester_flush_failed_retries_total{}[1h])) > %s and
sum by (%s) (increase(tempo_ingester_flush_failed_retries_total{}[5m])) > 0
||| % [$._config.group_by_cluster, $._config.alerts.flushes_per_hour_failed, $._config.group_by_cluster],
'for': '5m',
labels: {
severity: 'critical',
},
annotations: {
message: 'Greater than %s flushes have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed,
message: 'Greater than %s flush retries have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed,
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing',
},
},
Expand Down
7 changes: 4 additions & 3 deletions operations/tempo-mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,10 @@ How it **works**:
- If flushing fails, the ingester will keep retrying until restarted
- Blocks that have been flushed successfully will be deleted from the ingester, by default after 15m
Failed flushes could be caused by any number of different things: bad block, permissions issues, rate limiting, failing backend,...
Tempo will continue to retry sending the blocks until it succeeds, but at some point your WAL files will start failing to write due
to out of disk issues.
Failed flushes could be caused by any number of different things: bad block,
permissions issues, rate limiting, failing backend, etc. Tempo will continue to
retry sending the blocks until it succeeds, but at some point your WAL files
will start failing to write due to out of disk issues.
Known issue: this can trigger during a rollout of the ingesters, see [tempo#1035](https://github.com/grafana/tempo/issues/1035).
Expand Down
14 changes: 12 additions & 2 deletions operations/tempo-mixin/yamls/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,24 @@
"for": "5m"
"labels":
"severity": "critical"
- "alert": "TempoIngesterFlushesFailing"
- "alert": "TempoIngesterFlushesUnhealthy"
"annotations":
"message": "Greater than 2 flushes have failed in the past hour."
"message": "Greater than 2 flush retries have occurred in the past hour."
"runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing"
"expr": |
sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[1h])) > 2 and
sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0
"for": "5m"
"labels":
"severity": "warning"
- "alert": "TempoIngesterFlushesFailing"
"annotations":
"message": "Greater than 2 flush retries have failed in the past hour."
"runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing"
"expr": |
sum by (cluster, namespace) (increase(tempo_ingester_flush_failed_retries_total{}[1h])) > 2 and
sum by (cluster, namespace) (increase(tempo_ingester_flush_failed_retries_total{}[5m])) > 0
"for": "5m"
"labels":
"severity": "critical"
- "alert": "TempoPollsFailing"
Expand Down

0 comments on commit d4c5a7c

Please sign in to comment.