Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Health check must not recreate stream/consumer #6362

Merged
merged 2 commits into from
Jan 10, 2025

Conversation

MauriceVanVeen
Copy link
Member

Calling healthz could result in streams and/or consumers to be restarted.
There's a race condition that can happen where a user recreates a stream/consumer and the health check kicks in at that moment. This would result in reviving a just-deleted stream/consumer, resulting in either dead streams/consumers remaining or potentially leaderless states if different RAFT groups would remain.

A stream/consumer must not be restarted in the health check as it has no awareness of what's happening in another part of the system. Is the stream just deleted, is it restarting due to an error, is it actually stopped due to a bug? It can't know, and it shouldn't assume it's safe to restart. Due to the way the JS lock is used combined with creating copies of the stream/consumer assignment means that various race conditions can happen where restarting would be the wrong choice.

More importantly (and put simply), stream/consumer assignments MUST only be changed via meta entries or meta snapshots. Doing it in any other place can result in race conditions/ordering issues. (Just like snapshotting in any other place than in the monitor routine resulted in race conditions before: #6153)

Detecting and correcting RAFT node skew is kept, although likely the health check shouldn't be doing that either. However, there was a bug where RAFT node skew would be detected for a consumer, it would be deleted, but not recreated if it was initially created within <10s. That's now fixed as well.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner January 10, 2025 12:06
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 158d7cd into main Jan 10, 2025
5 checks passed
@derekcollison derekcollison deleted the maurice/health-check-must-not-recreate branch January 10, 2025 22:19
neilalexander added a commit that referenced this pull request Jan 13, 2025
Includes the following:

- #6361
- #6362
- #6364
- #6367

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants