-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ra_server: handle higher-term AERs in receive_snapshot
state
#470
Conversation
Thanks for this, it looks ok to me. I will take a closer look tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case for this branch?
Lines 2086 to 2108 in 14eca5f
receive_snapshot_timeout(_Config) -> | |
N1 = ?N1, N2 = ?N2, N3 = ?N3, | |
#{N3 := {_, FState0 = #{cluster := Config, | |
current_term := CurTerm}, _}} | |
= init_servers([N1, N2, N3], {module, ra_queue, #{}}), | |
FState = FState0#{last_applied => 3}, | |
LastTerm = 1, % snapshot term | |
Idx = 6, | |
ISRpc = #install_snapshot_rpc{term = CurTerm, leader_id = N1, | |
meta = snap_meta(Idx, LastTerm, Config), | |
chunk_state = {1, last}, | |
data = []}, | |
{receive_snapshot, FState1, | |
[{next_event, ISRpc}, {record_leader_msg, _}]} = | |
ra_server:handle_follower(ISRpc, FState), | |
%% revert back to follower on timeout | |
{follower, #{log := Log}, _} | |
= ra_server:handle_receive_snapshot(receive_snapshot_timeout, FState1), | |
%% snapshot should be aborted | |
SS = ra_log:snapshot_state(Log), | |
undefined = ra_snapshot:accepting(SS), | |
ok. |
receive_snapshot_timeout
message for an AER from the next term
I've pushed some changes to the CI so if you rebase on main
I believe the CI should pass on the next run
src/ra_server.erl
Outdated
"abdicates term: ~b!", | ||
[LogId, Msg#append_entries_rpc.leader_id, | ||
Term, CurTerm]), | ||
{follower, update_term(Term, clear_leader_id(State)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to call ra_snapshot:abort_accept/1
here?
I think we should add some more changes on how the |
Otherwise, ra_server could become stuck in the `receive_snapshot` state if the current snapshot sender goes away and the new leader gets elected, before snapshot transfer is complete. In this case the new leader will keep sending AERs to this stuck server, which it will ignore but (unfortunately) keep resetting `receive_snapshot_timeout` each time. This constant timer reset prevents `receive_snapshot_timeout` timer to fire, unless `receive_snapshot` timeout option is impractically small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a formatting change.
Otherwise, few other events and RPCs, even when not explicitly handled, can inadvertedly cause the `receive_snapshot` timeout to reset. This way `ra_server_proc` can become stuck in that state under certain circumstances.
The RabbitMQ OCI build failure is due to the fact that external contributions do not have access to GitHub secrets. It has nothing to do with the code changes. |
Thanks! |
Proposed Changes
When in
receive_snapshot
state, server should handle at least higher-term AERs gracefully. Otherwise, server could become stuck in thereceive_snapshot
state if the current snapshot sender goes away and the new leader gets elected, before snapshot transfer is complete. In this case the new leader will keep sending AERs to this stuck server, which it will ignore but (unfortunately) keep resettingreceive_snapshot_timeout
each time. This constant timer reset preventsreceive_snapshot_timeout
timer to fire, unlessreceive_snapshot
timeout option is impractically small.Types of Changes
What types of changes does your code introduce to this project?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply. You can also fill these out after creatingthe PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.
CONTRIBUTING.md
documentAny dependent changes have been merged and published in related repositories