-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node_status_backend: reset backoff on peer checkin #11342
Conversation
/ci-repeat 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes sense to me.
This change looks good, i am wondering if we shouldn't also introduce a configuration that would allow to setup the max backoff time for node_status connections |
Failures: (unrelated, known)
|
Done.. network partition is an interesting situation.. |
When a peer restarts and a backoff is applied locally, it needs to be reset once the peer is available again. Otherwise the transport does not reconnect until the entire backoff elapses thus marking it unavailable for downstream consumers like partition balancer.
Adds node_status_reconnect_max_backoff_ms cluster configuration and defaults to 15s.
Last force pushed is to fix conflicts. |
There seems to be a related failure:
|
umm looks like debug build strikes again.. taking a look. |
Wrap it in a waiter. With debug builds the backend can potentially take some additional time to reach the desired state, especially after invalidating all the transports after resetting a backoff.
When a peer restarts and a backoff is applied locally, it needs to be
reset once the peer is available again. Otherwise the transport does not
reconnect until the entire backoff elapses thus marking it unavailable
for downstream consumers like partition balancer.
Fixes #5795
Fixes #11307
Fixes #11276
Backports Required
Release Notes