Update mixin for TempoIngesterFlushes thresholds (#1354)

* Update TempoIngesterFlushes thresholds and include warning * Update changelog * Replace tempo_ingester_failed_flushes_total with tempo_ingester_flush_failed_retries_total in mixin * Adjust alert message wording slightly * Adjust unhealthy metric and failed duration
grafana · Apr 18, 2022 · d4c5a7c · d4c5a7c
1 parent df42ba8
commit d4c5a7c
Show file tree

Hide file tree

Showing 4 changed files with 36 additions and 7 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,7 +5,9 @@
 * [CHANGE] Updated storage.trace.pool.queue_depth default from 200->10000. [#1345](https://github.com/grafana/tempo/pull/1345) (@joe-elliott)
 * [CHANGE] Update alpine images to 3.15 [#1330](https://github.com/grafana/tempo/pull/1330) (@zalegrala)
 * [CHANGE] Updated flags `-storage.trace.azure.storage-account-name` and `-storage.trace.s3.access_key` to no longer to be considered as secrets [#1356](https://github.com/grafana/tempo/pull/1356) (@simonswine)
+* [CHANGE] Add warning threshold for TempoIngesterFlushes and adjust critical threshold [#1354](https://github.com/grafana/tempo/pull/1354) (@zalegrala)
 * [CHANGE] Include lambda in serverless e2e tests [#1357](https://github.com/grafana/tempo/pull/1357) (@zalegrala)
+* [CHANGE] Replace mixin TempoIngesterFlushes metric to only look at retries [#1354](https://github.com/grafana/tempo/pull/1354) (@zalegrala)
 * [FEATURE]: v2 object encoding added. This encoding adds a start/end timestamp to every record to reduce proto marshalling and increase search speed.  
   **BREAKING CHANGE** After this rollout the distributors will use a new API on the ingesters. As such you must rollout all ingesters before rolling the 
   distributors. Also, during this period, the ingesters will use considerably more resources and as such should be scaled up (or incoming traffic should be

diff --git a/operations/tempo-mixin/alerts.libsonnet b/operations/tempo-mixin/alerts.libsonnet
@@ -84,17 +84,33 @@
           },
           {
             // wait 5m for failed flushes to self-heal using retries
-            alert: 'TempoIngesterFlushesFailing',
+            alert: 'TempoIngesterFlushesUnhealthy',
             expr: |||
               sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[1h])) > %s and
               sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0
             ||| % [$._config.group_by_cluster, $._config.alerts.flushes_per_hour_failed, $._config.group_by_cluster],
             'for': '5m',
+            labels: {
+              severity: 'warning',
+            },
+            annotations: {
+              message: 'Greater than %s flush retries have occurred in the past hour.' % $._config.alerts.flushes_per_hour_failed,
+              runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing',
+            },
+          },
+          {
+            // wait 10m for failed flushes to self-heal using retries
+            alert: 'TempoIngesterFlushesFailing',
+            expr: |||
+              sum by (%s) (increase(tempo_ingester_flush_failed_retries_total{}[1h])) > %s and
+              sum by (%s) (increase(tempo_ingester_flush_failed_retries_total{}[5m])) > 0
+            ||| % [$._config.group_by_cluster, $._config.alerts.flushes_per_hour_failed, $._config.group_by_cluster],
+            'for': '5m',
             labels: {
               severity: 'critical',
             },
             annotations: {
-              message: 'Greater than %s flushes have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed,
+              message: 'Greater than %s flush retries have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed,
               runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing',
             },
           },

diff --git a/operations/tempo-mixin/runbook.md b/operations/tempo-mixin/runbook.md
@@ -86,9 +86,10 @@ How it **works**:
 - If flushing fails, the ingester will keep retrying until restarted
 - Blocks that have been flushed successfully will be deleted from the ingester, by default after 15m
 
-Failed flushes could be caused by any number of different things: bad block, permissions issues, rate limiting, failing backend,...
-Tempo will continue to retry sending the blocks until it succeeds, but at some point your WAL files will start failing to write due
-to out of disk issues.
+Failed flushes could be caused by any number of different things: bad block,
+permissions issues, rate limiting, failing backend, etc. Tempo will continue to
+retry sending the blocks until it succeeds, but at some point your WAL files
+will start failing to write due to out of disk issues.
 
 Known issue: this can trigger during a rollout of the ingesters, see [tempo#1035](https://github.com/grafana/tempo/issues/1035).
 

diff --git a/operations/tempo-mixin/yamls/alerts.yaml b/operations/tempo-mixin/yamls/alerts.yaml
@@ -52,14 +52,24 @@
     "for": "5m"
     "labels":
       "severity": "critical"
-  - "alert": "TempoIngesterFlushesFailing"
+  - "alert": "TempoIngesterFlushesUnhealthy"
     "annotations":
-      "message": "Greater than 2 flushes have failed in the past hour."
+      "message": "Greater than 2 flush retries have occurred in the past hour."
       "runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing"
     "expr": |
       sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[1h])) > 2 and
       sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0
     "for": "5m"
+    "labels":
+      "severity": "warning"
+  - "alert": "TempoIngesterFlushesFailing"
+    "annotations":
+      "message": "Greater than 2 flush retries have failed in the past hour."
+      "runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing"
+    "expr": |
+      sum by (cluster, namespace) (increase(tempo_ingester_flush_failed_retries_total{}[1h])) > 2 and
+      sum by (cluster, namespace) (increase(tempo_ingester_flush_failed_retries_total{}[5m])) > 0
+    "for": "5m"
     "labels":
       "severity": "critical"
   - "alert": "TempoPollsFailing"