From 18c3cb96d086a7e68a527e8c87839f06ec36f2d9 Mon Sep 17 00:00:00 2001 From: Marco Pracucci Date: Thu, 1 Jul 2021 16:05:04 +0200 Subject: [PATCH 1/2] Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck Signed-off-by: Marco Pracucci --- jsonnet/mimir-mixin/docs/playbooks.md | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/jsonnet/mimir-mixin/docs/playbooks.md b/jsonnet/mimir-mixin/docs/playbooks.md index 4373bad8ac7..99afc0a53cd 100644 --- a/jsonnet/mimir-mixin/docs/playbooks.md +++ b/jsonnet/mimir-mixin/docs/playbooks.md @@ -408,11 +408,31 @@ _TODO: this playbook has not been written yet._ ### CortexFrontendQueriesStuck -_TODO: this playbook has not been written yet._ +This alert fires if Cortex is running without query-scheduler and queries are piling up in the query-frontend queue. + +The procedure to investigate it is the same as the one for [`CortexSchedulerQueriesStuck`](#CortexSchedulerQueriesStuck): please see the other playbook for more details. ### CortexSchedulerQueriesStuck -_TODO: this playbook has not been written yet._ +This alert fires if Cortex is queries are piling up in the query-scheduler. + +How it **works**: +- A query-frontend API endpoint is called to execute a query +- The query-frontend enqueues the request to the query-scheduler +- The query-scheduler is responsible to dispatch enqueued queries to idle querier workers +- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler + +How to **investigate**: +- Are queriers in a crash loop (eg. OOMKilled)? + - `OOMKilled`: temporarily increase queriers memory request/limit + - `panic`: look for the stack trace in the logs and investigate from there +- Is QPS increased? + - Scale up queriers to satisfy the increased workload +- Is query latency increased? + - An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue + - Temporarily scale up queriers to try to stop the bleed + - Check the `Cortex / Slow Queries` dashboard to see if a specific tenant is running heavy queries + - If it's a multi-tenant Cortex cluster and shuffle-sharing is disabled for queriers, you may consider to enable it only for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). ### CortexCacheRequestErrors From 1ba6047c98fad2a80139dfc0111ac5b52a51a3e2 Mon Sep 17 00:00:00 2001 From: Marco Pracucci Date: Fri, 2 Jul 2021 14:16:04 +0200 Subject: [PATCH 2/2] Addressed review comments Signed-off-by: Marco Pracucci --- jsonnet/mimir-mixin/docs/playbooks.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/jsonnet/mimir-mixin/docs/playbooks.md b/jsonnet/mimir-mixin/docs/playbooks.md index 99afc0a53cd..acfc1ff443d 100644 --- a/jsonnet/mimir-mixin/docs/playbooks.md +++ b/jsonnet/mimir-mixin/docs/playbooks.md @@ -414,13 +414,13 @@ The procedure to investigate it is the same as the one for [`CortexSchedulerQuer ### CortexSchedulerQueriesStuck -This alert fires if Cortex is queries are piling up in the query-scheduler. +This alert fires if queries are piling up in the query-scheduler. How it **works**: - A query-frontend API endpoint is called to execute a query - The query-frontend enqueues the request to the query-scheduler -- The query-scheduler is responsible to dispatch enqueued queries to idle querier workers -- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler +- The query-scheduler is responsible for dispatching enqueued queries to idle querier workers +- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query How to **investigate**: - Are queriers in a crash loop (eg. OOMKilled)? @@ -431,8 +431,11 @@ How to **investigate**: - Is query latency increased? - An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue - Temporarily scale up queriers to try to stop the bleed - - Check the `Cortex / Slow Queries` dashboard to see if a specific tenant is running heavy queries - - If it's a multi-tenant Cortex cluster and shuffle-sharing is disabled for queriers, you may consider to enable it only for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). + - Check if a specific tenant is running heavy queries + - Run `sum by (user) (cortex_query_scheduler_queue_length{namespace=""}) > 0` to find tenants with enqueued queries + - Check the `Cortex / Slow Queries` dashboard to find slow queries + - On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). + - On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full. ### CortexCacheRequestErrors