ra_server: handle higher-term AERs in `receive_snapshot` state #470

keynslug · 2024-09-11T11:44:34Z

Proposed Changes

When in receive_snapshot state, server should handle at least higher-term AERs gracefully. Otherwise, server could become stuck in the receive_snapshot state if the current snapshot sender goes away and the new leader gets elected, before snapshot transfer is complete. In this case the new leader will keep sending AERs to this stuck server, which it will ignore but (unfortunately) keep resetting receive_snapshot_timeout each time. This constant timer reset prevents receive_snapshot_timeout timer to fire, unless receive_snapshot timeout option is impractically small.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change, no corresponding GH issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation (correction or otherwise)
Cosmetics (whitespace, appearance)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
All tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)
~~Any dependent changes have been merged and published in related repositories~~

kjnilsson · 2024-09-11T19:01:33Z

Thanks for this, it looks ok to me. I will take a closer look tomorrow.

the-mikedavis

Could you add a test case for this branch?

ra/test/ra_server_SUITE.erl

Lines 2086 to 2108 in 14eca5f

    
           receive_snapshot_timeout(_Config) -> 
        
               N1 = ?N1, N2 = ?N2, N3 = ?N3, 
        
               #{N3 := {_, FState0 = #{cluster := Config, 
        
                                       current_term := CurTerm}, _}} 
        
               = init_servers([N1, N2, N3], {module, ra_queue, #{}}), 
        
               FState = FState0#{last_applied => 3}, 
        
               LastTerm = 1, % snapshot term 
        
               Idx = 6, 
        
               ISRpc = #install_snapshot_rpc{term = CurTerm, leader_id = N1, 
        
                                             meta = snap_meta(Idx, LastTerm, Config), 
        
                                             chunk_state = {1, last}, 
        
                                             data = []}, 
        
               {receive_snapshot, FState1, 
        
                [{next_event, ISRpc}, {record_leader_msg, _}]} = 
        
                   ra_server:handle_follower(ISRpc, FState), 
        
               %% revert back to follower on timeout 
        
               {follower, #{log := Log}, _} 
        
               = ra_server:handle_receive_snapshot(receive_snapshot_timeout, FState1), 
        
               %% snapshot should be aborted 
        
               SS = ra_log:snapshot_state(Log), 
        
               undefined = ra_snapshot:accepting(SS), 
        
               ok.

can be mostly reused and we can swap out the receive_snapshot_timeout message for an AER from the next term

I've pushed some changes to the CI so if you rebase on main I believe the CI should pass on the next run

kjnilsson · 2024-09-12T09:31:00Z

src/ra_server.erl

+          "abdicates term: ~b!",
+          [LogId, Msg#append_entries_rpc.leader_id,
+           Term, CurTerm]),
+    {follower, update_term(Term, clear_leader_id(State)),


I think we need to call ra_snapshot:abort_accept/1 here?

kjnilsson · 2024-09-12T09:35:44Z

I think we should add some more changes on how the state_timeout is reset to avoid to being reset for any message that isn't snapshot related. So that this action is only emitted when an #install_snapshot_rpc{} message is received. I think it would be fine to check the message type in ra_server_proc at least for now.

Otherwise, ra_server could become stuck in the `receive_snapshot` state if the current snapshot sender goes away and the new leader gets elected, before snapshot transfer is complete. In this case the new leader will keep sending AERs to this stuck server, which it will ignore but (unfortunately) keep resetting `receive_snapshot_timeout` each time. This constant timer reset prevents `receive_snapshot_timeout` timer to fire, unless `receive_snapshot` timeout option is impractically small.

kjnilsson

Just a formatting change.

src/ra_server_proc.erl

Otherwise, few other events and RPCs, even when not explicitly handled, can inadvertedly cause the `receive_snapshot` timeout to reset. This way `ra_server_proc` can become stuck in that state under certain circumstances.

michaelklishin · 2024-09-12T15:22:18Z

The RabbitMQ OCI build failure is due to the fact that external contributions do not have access to GitHub secrets. It has nothing to do with the code changes.

the-mikedavis · 2024-09-12T15:27:22Z

Thanks!

kjnilsson self-requested a review September 11, 2024 19:00

the-mikedavis reviewed Sep 11, 2024

View reviewed changes

kjnilsson reviewed Sep 12, 2024

View reviewed changes

keynslug added 2 commits September 12, 2024 13:20

ra_server: abort snapshot accept before turning into follower

2a31c1c

keynslug force-pushed the main branch from 559b77a to bd78f98 Compare September 12, 2024 11:47

kjnilsson approved these changes Sep 12, 2024

View reviewed changes

src/ra_server_proc.erl Show resolved Hide resolved

keynslug added 2 commits September 12, 2024 16:38

ra_server_proc: reset receive_snapshot timer only on progress

7851f91

Otherwise, few other events and RPCs, even when not explicitly handled, can inadvertedly cause the `receive_snapshot` timeout to reset. This way `ra_server_proc` can become stuck in that state under certain circumstances.

ra_server: test higher-term AERs abort receive snapshot

5174d50

keynslug force-pushed the main branch from bd78f98 to 5174d50 Compare September 12, 2024 14:40

the-mikedavis approved these changes Sep 12, 2024

View reviewed changes

the-mikedavis merged commit f6ab5b9 into rabbitmq:main Sep 12, 2024
7 of 8 checks passed

the-mikedavis added this to the 2.14.1 milestone Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ra_server: handle higher-term AERs in `receive_snapshot` state #470

ra_server: handle higher-term AERs in `receive_snapshot` state #470

keynslug commented Sep 11, 2024

kjnilsson commented Sep 11, 2024

the-mikedavis left a comment

kjnilsson Sep 12, 2024

kjnilsson commented Sep 12, 2024 •

edited

Loading

kjnilsson left a comment

michaelklishin commented Sep 12, 2024

the-mikedavis commented Sep 12, 2024

	receive_snapshot_timeout(_Config) ->
	N1 = ?N1, N2 = ?N2, N3 = ?N3,
	#{N3 := {_, FState0 = #{cluster := Config,
	current_term := CurTerm}, _}}
	= init_servers([N1, N2, N3], {module, ra_queue, #{}}),
	FState = FState0#{last_applied => 3},
	LastTerm = 1, % snapshot term
	Idx = 6,
	ISRpc = #install_snapshot_rpc{term = CurTerm, leader_id = N1,
	meta = snap_meta(Idx, LastTerm, Config),
	chunk_state = {1, last},
	data = []},
	{receive_snapshot, FState1,
	[{next_event, ISRpc}, {record_leader_msg, _}]} =
	ra_server:handle_follower(ISRpc, FState),

	%% revert back to follower on timeout
	{follower, #{log := Log}, _}
	= ra_server:handle_receive_snapshot(receive_snapshot_timeout, FState1),
	%% snapshot should be aborted
	SS = ra_log:snapshot_state(Log),
	undefined = ra_snapshot:accepting(SS),
	ok.

ra_server: handle higher-term AERs in receive_snapshot state #470

ra_server: handle higher-term AERs in receive_snapshot state #470

Conversation

keynslug commented Sep 11, 2024

Proposed Changes

Types of Changes

Checklist

kjnilsson commented Sep 11, 2024

the-mikedavis left a comment

Choose a reason for hiding this comment

kjnilsson Sep 12, 2024

Choose a reason for hiding this comment

kjnilsson commented Sep 12, 2024 • edited Loading

kjnilsson left a comment

Choose a reason for hiding this comment

michaelklishin commented Sep 12, 2024

the-mikedavis commented Sep 12, 2024

ra_server: handle higher-term AERs in `receive_snapshot` state #470

ra_server: handle higher-term AERs in `receive_snapshot` state #470

kjnilsson commented Sep 12, 2024 •

edited

Loading