Enhance MimirRequestLatency runbook with more advice (#1967)

* Enhance MimirRequestLatency runbook with more advice Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
grafana · Jun 3, 2022 · 0088e48 · 0088e48
1 parent 8bff6d0
commit 0088e48
Showing 1 changed file with 9 additions and 0 deletions.
diff --git a/docs/sources/operators-guide/mimir-runbooks/_index.md b/docs/sources/operators-guide/mimir-runbooks/_index.md
@@ -222,6 +222,15 @@ How to **investigate**:
         - Check `Memcached Overview` dashboard
         - If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Mimir / Scaling` dashboard and make reasonable adjustments as necessary.
         - If memcached eviction rate is zero or very low, then it may be caused by "first time" queries
+      - Cache query timeouts
+        - Check store-gateway logs and look for warnings about timed out Memcached queries (example query: `{namespace="example-mimir-cluster", name=~"store-gateway.*"} |= "level=warn" |= "memcached" |= "timeout"`)
+        - If there are indeed a lot of timed out Memcached queries, consider whether the store-gateway Memcached timeout setting (`-blocks-storage.bucket-store.chunks-cache.memcached.timeout`) is sufficient
+    - By consulting the "Queue length" panel of the `Mimir / Queries` dashboard, determine if queries are waiting in queue due to busy queriers (an indication of this would be queue length > 0 for some time)
+      - If queries are waiting in queue
+        - Consider scaling up number of queriers if they're not auto-scaled; if auto-scaled, check auto-scaling parameters
+      - If queries are not waiting in queue
+        - Consider [enabling query sharding]({{< relref "../architecture/query-sharding/index.md#how-to-enable-query-sharding" >}}) if not already enabled, to increase query parallelism
+        - If query sharding already enabled, consider increasing total number of query shards (`query_sharding_total_shards`) for tenants submitting slow queries, so their queries can be further parallelized
 
 #### Alertmanager