Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ra_server: handle higher-term AERs in receive_snapshot state #470

Merged
merged 4 commits into from
Sep 12, 2024

Conversation

keynslug
Copy link
Contributor

Proposed Changes

When in receive_snapshot state, server should handle at least higher-term AERs gracefully. Otherwise, server could become stuck in the receive_snapshot state if the current snapshot sender goes away and the new leader gets elected, before snapshot transfer is complete. In this case the new leader will keep sending AERs to this stuck server, which it will ignore but (unfortunately) keep resetting receive_snapshot_timeout each time. This constant timer reset prevents receive_snapshot_timeout timer to fire, unless receive_snapshot timeout option is impractically small.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change, no corresponding GH issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation (correction or otherwise)
  • Cosmetics (whitespace, appearance)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

@kjnilsson kjnilsson self-requested a review September 11, 2024 19:00
@kjnilsson
Copy link
Contributor

Thanks for this, it looks ok to me. I will take a closer look tomorrow.

Copy link
Member

@the-mikedavis the-mikedavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case for this branch?

ra/test/ra_server_SUITE.erl

Lines 2086 to 2108 in 14eca5f

receive_snapshot_timeout(_Config) ->
N1 = ?N1, N2 = ?N2, N3 = ?N3,
#{N3 := {_, FState0 = #{cluster := Config,
current_term := CurTerm}, _}}
= init_servers([N1, N2, N3], {module, ra_queue, #{}}),
FState = FState0#{last_applied => 3},
LastTerm = 1, % snapshot term
Idx = 6,
ISRpc = #install_snapshot_rpc{term = CurTerm, leader_id = N1,
meta = snap_meta(Idx, LastTerm, Config),
chunk_state = {1, last},
data = []},
{receive_snapshot, FState1,
[{next_event, ISRpc}, {record_leader_msg, _}]} =
ra_server:handle_follower(ISRpc, FState),
%% revert back to follower on timeout
{follower, #{log := Log}, _}
= ra_server:handle_receive_snapshot(receive_snapshot_timeout, FState1),
%% snapshot should be aborted
SS = ra_log:snapshot_state(Log),
undefined = ra_snapshot:accepting(SS),
ok.
can be mostly reused and we can swap out the receive_snapshot_timeout message for an AER from the next term

I've pushed some changes to the CI so if you rebase on main I believe the CI should pass on the next run

"abdicates term: ~b!",
[LogId, Msg#append_entries_rpc.leader_id,
Term, CurTerm]),
{follower, update_term(Term, clear_leader_id(State)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to call ra_snapshot:abort_accept/1 here?

@kjnilsson
Copy link
Contributor

kjnilsson commented Sep 12, 2024

I think we should add some more changes on how the state_timeout is reset to avoid to being reset for any message that isn't snapshot related. So that this action is only emitted when an #install_snapshot_rpc{} message is received. I think it would be fine to check the message type in ra_server_proc at least for now.

Otherwise, ra_server could become stuck in the `receive_snapshot` state
if the current snapshot sender goes away and the new leader gets
elected, before snapshot transfer is complete. In this case the new
leader will keep sending AERs to this stuck server, which it will ignore
but (unfortunately) keep resetting `receive_snapshot_timeout` each time.
This constant timer reset prevents `receive_snapshot_timeout` timer to
fire, unless `receive_snapshot` timeout option is impractically small.
Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a formatting change.

src/ra_server_proc.erl Show resolved Hide resolved
Otherwise, few other events and RPCs, even when not explicitly
handled, can inadvertedly cause the `receive_snapshot` timeout to
reset. This way `ra_server_proc` can become stuck in that state
under certain circumstances.
@michaelklishin
Copy link
Member

The RabbitMQ OCI build failure is due to the fact that external contributions do not have access to GitHub secrets. It has nothing to do with the code changes.

@the-mikedavis the-mikedavis merged commit f6ab5b9 into rabbitmq:main Sep 12, 2024
7 of 8 checks passed
@the-mikedavis
Copy link
Member

Thanks!

@the-mikedavis the-mikedavis added this to the 2.14.1 milestone Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants