Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working. #4365

Closed
1 of 2 tasks
alanprot opened this issue Jul 15, 2021 · 1 comment · Fixed by #4366
Closed
1 of 2 tasks

Comments

@alanprot
Copy link
Member

alanprot commented Jul 15, 2021

Describe the bug
If we set the unregister_on_shutdown=false and -ingester.heartbeat-period=0 ingesters will not be able to restart and get stuck during start up.

  • unregister_on_shutdown=false means that the ingester will not remove itself from the ring when shutting down.

Looking at the logs during the rolling update we can see 2 logs (both from ingester-11 - first ingester being restarted) that give us the hint what is going on:

level=info ts=2021-07-14T17:08:55.058532543Z caller=lifecycler.go:590 msg="existing entry found in ring" state=ACTIVE tokens=512 ring=ingester


AND

level=warn ts=2021-07-14T17:13:54.229603834Z caller=lifecycler.go:238 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-11 in state LEAVING"

Looking at the code, we can see that even with the the flag unregisterOnShutdown=false cortex still flip the status to LEAVING but does not remove it from the ring - this is expected.

The problem now is when the same ingester is now starting up. During this stage we see that we have a check to flip back from LEAVING to ACTIVE right here and we can see it happened as
we can see the logs from this line. But even though this state is flipped back to ACTIVE in memory, this change is not persisted on the KV as stated in the comment here - cortex returns the nil there.

As the ingester state is not flipped back to ACTIVE on the KV, it get stuck on starting up. The reason for that is we look at the KV during startup (not in memory) as we can see here and here

  • Why it works when heart beat != 0
    • When the heartbeat is not disabled, ingester will eventually flush what is in memory to the KV here
  • Why it works when unregister_on_shutdown=true
    • When we unregister the ingester on shutdown, we save the KV during the startup as we are re-adding the itself in the ring here

To Reproduce
Steps to reproduce the behavior:

  1. Start Cortex (6b8bd5a)
  2. Configure unregister_on_shutdown=false and -ingester.heartbeat-period=0
  3. Scale up ingester to 3
  4. Perform a rolling restart

Expected behavior
All ingesters should restart successfully

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Helm

Storage Engine

  • Blocks
  • Chunks

Additional Context
I think the solution would be to update the KV is we flip from LEAVING to ACTIVE here

@alanprot alanprot changed the title unregister_on_shutdown=false + -ingester.heartbeat-period=0 does not work. unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working. Jul 15, 2021
@tomwilkie
Copy link
Contributor

Nice find! Looks pretty straight forward to fix, but I wonder if we can build a unit test that covers this while configuration space like the replication one does...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants