You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
If we set the unregister_on_shutdown=false and -ingester.heartbeat-period=0 ingesters will not be able to restart and get stuck during start up.
unregister_on_shutdown=false means that the ingester will not remove itself from the ring when shutting down.
Looking at the logs during the rolling update we can see 2 logs (both from ingester-11 - first ingester being restarted) that give us the hint what is going on:
level=info ts=2021-07-14T17:08:55.058532543Z caller=lifecycler.go:590 msg="existing entry found in ring" state=ACTIVE tokens=512 ring=ingester
AND
level=warn ts=2021-07-14T17:13:54.229603834Z caller=lifecycler.go:238 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-11 in state LEAVING"
Looking at the code, we can see that even with the the flag unregisterOnShutdown=false cortex still flip the status to LEAVING but does not remove it from the ring - this is expected.
The problem now is when the same ingester is now starting up. During this stage we see that we have a check to flip back from LEAVING to ACTIVE right here and we can see it happened as
we can see the logs from this line. But even though this state is flipped back to ACTIVE in memory, this change is not persisted on the KV as stated in the comment here - cortex returns the nil there.
As the ingester state is not flipped back to ACTIVE on the KV, it get stuck on starting up. The reason for that is we look at the KV during startup (not in memory) as we can see here and here
Why it works when heart beat != 0
When the heartbeat is not disabled, ingester will eventually flush what is in memory to the KV here
Why it works when unregister_on_shutdown=true
When we unregister the ingester on shutdown, we save the KV during the startup as we are re-adding the itself in the ring here
Configure unregister_on_shutdown=false and -ingester.heartbeat-period=0
Scale up ingester to 3
Perform a rolling restart
Expected behavior
All ingesters should restart successfully
Environment:
Infrastructure: Kubernetes
Deployment tool: Helm
Storage Engine
Blocks
Chunks
Additional Context
I think the solution would be to update the KV is we flip from LEAVING to ACTIVE here
The text was updated successfully, but these errors were encountered:
alanprot
changed the title
unregister_on_shutdown=false + -ingester.heartbeat-period=0 does not work.
unregister_on_shutdown=false + -ingester.heartbeat-period=0 seems to not be working.
Jul 15, 2021
Nice find! Looks pretty straight forward to fix, but I wonder if we can build a unit test that covers this while configuration space like the replication one does...
Describe the bug
If we set the
unregister_on_shutdown=false
and-ingester.heartbeat-period=0
ingesters will not be able to restart and get stuck during start up.unregister_on_shutdown=false
means that the ingester will not remove itself from the ring when shutting down.Looking at the logs during the rolling update we can see 2 logs (both from ingester-11 - first ingester being restarted) that give us the hint what is going on:
Looking at the code, we can see that even with the the flag
unregisterOnShutdown=false
cortex still flip the status toLEAVING
but does not remove it from the ring - this is expected.The problem now is when the same ingester is now starting up. During this stage we see that we have a check to flip back from
LEAVING
toACTIVE
right here and we can see it happened aswe can see the logs from this line. But even though this state is flipped back to
ACTIVE
in memory, this change is not persisted on the KV as stated in the comment here - cortex returns the nil there.As the ingester state is not flipped back to
ACTIVE
on the KV, it get stuck on starting up. The reason for that is we look at the KV during startup (not in memory) as we can see here and hereunregister_on_shutdown=true
To Reproduce
Steps to reproduce the behavior:
unregister_on_shutdown=false
and-ingester.heartbeat-period=0
Expected behavior
All ingesters should restart successfully
Environment:
Storage Engine
Additional Context
I think the solution would be to update the KV is we flip from LEAVING to ACTIVE here
The text was updated successfully, but these errors were encountered: