unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working. #4365

alanprot · 2021-07-15T17:46:48Z

Describe the bug
If we set the unregister_on_shutdown=false and -ingester.heartbeat-period=0 ingesters will not be able to restart and get stuck during start up.

unregister_on_shutdown=false means that the ingester will not remove itself from the ring when shutting down.

Looking at the logs during the rolling update we can see 2 logs (both from ingester-11 - first ingester being restarted) that give us the hint what is going on:

level=info ts=2021-07-14T17:08:55.058532543Z caller=lifecycler.go:590 msg="existing entry found in ring" state=ACTIVE tokens=512 ring=ingester


AND

level=warn ts=2021-07-14T17:13:54.229603834Z caller=lifecycler.go:238 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-11 in state LEAVING"

Looking at the code, we can see that even with the the flag unregisterOnShutdown=false cortex still flip the status to LEAVING but does not remove it from the ring - this is expected.

The problem now is when the same ingester is now starting up. During this stage we see that we have a check to flip back from LEAVING to ACTIVE right here and we can see it happened as
we can see the logs from this line. But even though this state is flipped back to ACTIVE in memory, this change is not persisted on the KV as stated in the comment here - cortex returns the nil there.

As the ingester state is not flipped back to ACTIVE on the KV, it get stuck on starting up. The reason for that is we look at the KV during startup (not in memory) as we can see here and here

Why it works when heart beat != 0
- When the heartbeat is not disabled, ingester will eventually flush what is in memory to the KV here
Why it works when unregister_on_shutdown=true
- When we unregister the ingester on shutdown, we save the KV during the startup as we are re-adding the itself in the ring here

To Reproduce
Steps to reproduce the behavior:

Start Cortex (6b8bd5a)
Configure unregister_on_shutdown=false and -ingester.heartbeat-period=0
Scale up ingester to 3
Perform a rolling restart

Expected behavior
All ingesters should restart successfully

Environment:

Infrastructure: Kubernetes
Deployment tool: Helm

Storage Engine

Blocks
Chunks

Additional Context
I think the solution would be to update the KV is we flip from LEAVING to ACTIVE here

The text was updated successfully, but these errors were encountered:

tomwilkie · 2021-07-15T19:16:48Z

Nice find! Looks pretty straight forward to fix, but I wonder if we can build a unit test that covers this while configuration space like the replication one does...

alanprot changed the title ~~unregister_on_shutdown=false + -ingester.heartbeat-period=0 does not work.~~ unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working. Jul 15, 2021

alanprot mentioned this issue Jul 15, 2021

Fix Ingesters unable to re-join the cluster then unregister_on_shutdown=false + -ingester.heartbeat-period=0 #4366

Merged

3 tasks

pracucci closed this as completed in #4366 Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working. #4365

unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working. #4365

alanprot commented Jul 15, 2021 •

edited

Loading

tomwilkie commented Jul 15, 2021

unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working. #4365

unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working. #4365

Comments

alanprot commented Jul 15, 2021 • edited Loading

tomwilkie commented Jul 15, 2021

alanprot commented Jul 15, 2021 •

edited

Loading