Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evenly spread queriers across available nodes #6415

Merged
merged 6 commits into from
Jun 20, 2022

Conversation

salvacorts
Copy link
Contributor

@salvacorts salvacorts commented Jun 17, 2022

What this PR does / why we need it:

Currently, queriers run on different nodes. We do this by using the following anti-affinity rule:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          name: querier
      topologyKey: kubernetes.io/hostname

The original reasoning behind this is to make our systems more resilient to node failures. For example, if we have 5 queriers, all scheduled in the same node, and the node fails, we would not be able to serve queries until the pods get rescheduled and running on another node.

When new queriers are created, they need to be scheduled into different nodes. If there are not enough nodes in the cluster, new nodes need to be provisioned. This has two implications:

  • Cost: Each node has an implicit cost on top of the workload we deploy into the node.
  • Latency: It takes longer to allocate a new node into the cluster, and then schedule the pod, than to schedule it right away into an already available node.

This PR uses Kubernetes's TopologySpreadConstraints to evenly spread queriers across the available nodes, but allows more than one querier to be scheduled into the same node.

Checklist

  • Documentation added
  • Tests updated
  • Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@salvacorts salvacorts requested a review from a team as a code owner June 17, 2022 09:06
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0.6%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
-        distributor	-0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@salvacorts
Copy link
Contributor Author

As suggested by @dannykopping in the design doc, I added an entry to the changelog and the upgrade guide.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
-        distributor	-0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @salvacorts, but I think we should make it configurable.

If you want to keep running up to one querier per node, you will need to revert the changes for production/ksonnet/loki/querier.libsonnet
made at 6415.

How about we offer both the node affinity & the topology spread via a config value (defaulted to topology spread), and allow configuring the max skew?

This could later be expanded for all components (defaulted to node affinity for now)

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@salvacorts
Copy link
Contributor Author

How about we offer both the node affinity & the topology spread via a config value (defaulted to topology spread), and allow configuring the max skew?

@dannykopping That's a good idea. I just added two new config parameters: querier.use_topology_spread and querier.topology_spread_max_skew. I also updated the upgrade guide to reflect these changes.

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Two minor nits

@@ -36,8 +36,7 @@ The output is incredibly verbose as it shows the entire internal config struct u
#### Evenly spread queriers across kubernetes nodes

We now evenly spread queriers across the available kubernetes nodes, but allowing more than one querier to be scheduled into the same node.
If you want to keep running up to one querier per node, you will need to revert the changes for `production/ksonnet/loki/querier.libsonnet`
made at [6415](https://github.com/grafana/loki/pull/6415).
If you want to keep running up to one querier per node, set `$._config.querier.use_topology_spread` to false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you want to keep running up to one querier per node, set `$._config.querier.use_topology_spread` to false.
If you want to run at most a single querier per node, set `$._config.querier.use_topology_spread` to false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

production/ksonnet/loki/config.libsonnet Show resolved Hide resolved
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
- querier/queryrange	-0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0.4%
+               loki	0%

@dannykopping dannykopping merged commit 1cfc30f into main Jun 20, 2022
@dannykopping dannykopping deleted the salvacorts/queriers-nodes-even-spread branch June 20, 2022 10:54
salvacorts added a commit that referenced this pull request Jun 23, 2022
* Evenly spread queriers across available nodes

* Fix lint issue

* Add entry to the CHANGELOG and the Upgrade Guide

* Make topology spread configurable

* Apply CR feedback
JordanRushing added a commit to JordanRushing/loki that referenced this pull request Jun 23, 2022
@osg-grafana osg-grafana added type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories and removed area/docs labels Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/S type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants