[FIXED] Health check must not recreate stream/consumer #6362
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Calling healthz could result in streams and/or consumers to be restarted.
There's a race condition that can happen where a user recreates a stream/consumer and the health check kicks in at that moment. This would result in reviving a just-deleted stream/consumer, resulting in either dead streams/consumers remaining or potentially leaderless states if different RAFT groups would remain.
A stream/consumer must not be restarted in the health check as it has no awareness of what's happening in another part of the system. Is the stream just deleted, is it restarting due to an error, is it actually stopped due to a bug? It can't know, and it shouldn't assume it's safe to restart. Due to the way the JS lock is used combined with creating copies of the stream/consumer assignment means that various race conditions can happen where restarting would be the wrong choice.
More importantly (and put simply), stream/consumer assignments MUST only be changed via meta entries or meta snapshots. Doing it in any other place can result in race conditions/ordering issues. (Just like snapshotting in any other place than in the monitor routine resulted in race conditions before: #6153)
Detecting and correcting RAFT node skew is kept, although likely the health check shouldn't be doing that either. However, there was a bug where RAFT node skew would be detected for a consumer, it would be deleted, but not recreated if it was initially created within <10s. That's now fixed as well.
Signed-off-by: Maurice van Veen github@mauricevanveen.com