Skip to content

If I restart the query-frontend while queriers are running then we can't achieve -querier.max-concurrent #4391

Closed
@alvinlin123

Description

@alvinlin123

Describe the bug
If query-frontend and querier are restarted at the same time, or query-frontend is restarted while queriers are running, then -querier.max-concurrent cannot be achieved.

To Reproduce

  1. Restart just queriers by doing a rollout restart, do not restart query-frontend
  2. Make sure your system is in steady steady state and you can achieve -querier.max-concurrent
  3. Restart query-frontend
  4. hammer all your query frontends with expensive queries and observe -querier.max-concurrent is no longer achievable.

Expected behavior
Should still be able to achieve -querier.max-concurrent.

Environment:
We are running on k8s.

Storage Engine

  • Blocks
  • Chunks

Additional Context
My suspicion is because in the worker.go AddressRemoved does not call resetConcurrency()

Imagine the following cases:

  • You have 1 querier and 3 query-frontend (fe1, fe2, and fe3)
  • your -querier.max-concurrent is set to 8
  • So, each query frontend have at least 2 connection to the queriers. Because 8 is not divisible by 3, and 8 modulo 3 is 2, so there will be extra connection between fe1 and fe2 to the querier.
  • So, fe1 has 3 connection to querier, fe2 has 3, and fe3 has 2.

Now, we restart the query-frontend, and the DNS Watch on the querier (worker.go) will get to work and start adding and removing addresses.

  • During deployment we will have 6 query-frontends fe1 to fe6 because we spin up new pods first
  • So you get into a stat where fe1 has 2 connection to querier, fe2 has 2, fe3 has 1, fe4 has1, fe5 has 1, and fe6 has 1
  • Then we will spin down the old pod, fe1 to fe3.
  • Because the AddressRemoved method does not call resetConcurrency() to recalculate the load distribution, we end up having fe4 has 1 connection to querier, fe5 has 1, and fe6 has 1. Which is just 3 instead of 8.

Below is a graph showing achievement of -querier.max-concurrent=8 during different phases.

Grafana

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions