Race condition causes fullsync replication to never complete #599

kellymclaughlin · 2014-06-06T04:55:01Z

Description

There is a race condition between partition ownership handoff and fullsync replication that can cause fullsync to hang indefinitely when using the keylist strategy. The race occurs when the process of ownership handoff for a partition begins, then fullsync replication begins for that same partition before the ownership handoff has completed, and then handoff completes before the fullsync process has fully initiated the building of keylists. When handoff completes for the partition and the vnode process shuts down, the riak_repl_fullsync_helper process is left permanently stuck in a receive call and the associated riak_repl_keylist_server or riak_repl_keylist_client process is left perpetually waiting for a response from the riak_repl_fullsync_helper process that never comes.

The reason this happens can be seen on these two lines. The vnode process shuts down normally after handoff completes and the guard explicitly ignores such a message and leaves the process waiting in the receive.

Observations

I first observed this issue while reviewing and testing #590 (relevant comment here).

Using riak_test I was able to trigger the same race condition and examine the state of some of the relevant fullsync processes. After seeing repeated riak_repl2_fscoordinator:refresh_stats_worker:866 Gathering source data for messages on a source cluster node, the following logging on a sink cluster node led me to the first process for examination:

2014-06-05 12:07:14.962 [info] <0.3761.0>@riak_core_handoff_sender:start_fold:152 Starting ownership_transfer transfer of riak_kv_vnode from 'dev4@127.0.0.1' 1118962191081472546749696200048404186924073353216 to 'dev5@127.0.0.1' 1118962191081472546749696200048404186924073353216
2014-06-05 12:07:14.965 [info] <0.3760.0>@riak_core_handoff_sender:start_fold:152 Starting ownership_transfer transfer of riak_kv_vnode from 'dev4@127.0.0.1' 1141798154164767904846628775559596109106197299200 to 'dev6@127.0.0.1' 1141798154164767904846628775559596109106197299200
2014-06-05 12:07:15.118 [info] <0.3760.0>@riak_core_handoff_sender:start_fold:242 ownership_transfer transfer of riak_kv_vnode from 'dev4@127.0.0.1' 1141798154164767904846628775559596109106197299200 to 'dev6@127.0.0.1' 1141798154164767904846628775559596109106197299200 completed: sent 494.00 B bytes in 1 of 1 objects in 0.14 seconds (3.47 KB/second)
2014-06-05 12:07:15.118 [info] <0.3761.0>@riak_core_handoff_sender:start_fold:242 ownership_transfer transfer of riak_kv_vnode from 'dev4@127.0.0.1' 1118962191081472546749696200048404186924073353216 to 'dev5@127.0.0.1' 1118962191081472546749696200048404186924073353216 completed: sent 494.00 B bytes in 1 of 1 objects in 0.14 seconds (3.46 KB/second)
2014-06-05 12:07:15.160 [info] <0.3889.0>@riak_core_service_mgr:try_ssl:358 "B" and "A" agreed to not use SSL
2014-06-05 12:07:15.163 [info] <0.3894.0>@riak_repl_keylist_client:wait_for_fullsync:94 Full-sync with site "A" starting; 1 partitions.
2014-06-05 12:07:15.163 [info] <0.3894.0>@riak_repl_keylist_client:request_partition:127 Full-sync with site "A"; starting fullsync for 1118962191081472546749696200048404186924073353216
2014-06-05 12:07:15.163 [info] <0.3894.0>@riak_repl_keylist_client:request_partition:133 Full-sync with site "A"; building keylist for 1118962191081472546749696200048404186924073353216, 0 remain

Examination of the process state of <0.3894.0> led me to conclude that it was in a healthy state and waiting to receive a message from another process:

(dev4@127.0.0.1)4> rp(sys:get_status(pid(0,3894,0))).
{status,<0.3894.0>,
        {module,gen_fsm},
        [[{'$ancestors',[<0.3893.0>,riak_repl2_fssink_sup,
                         riak_repl_sup,<0.748.0>]},
          {'$initial_call',{riak_repl_keylist_client,init,1}}],
         running,<0.3893.0>,[],
         [{header,"Status for state machine <0.3894.0>"},
          {data,[{"Status",running},
                 {"Parent",<0.3893.0>},
                 {"Logged events",[]},
                 {"StateName",request_partition}]},
          {data,[{"StateData",
                  #state{sitename = "A",socket = #Port<0.21130>,
                         transport = ranch_tcp,
                         work_dir = "./data/riak_repl/work/83238997/A-127.0.0.1:10046-127.0.0.1:62346",
                         partitions = [],
                         partition = 1118962191081472546749696200048404186924073353216,
                         kl_fn = "./data/riak_repl/work/83238997/A-127.0.0.1:10046-127.0.0.1:62346/1118962191081472546749696200048404186924073353216.ours.sterm",
                         kl_fh = undefined,kl_pid = <0.3895.0>,
                         kl_ref = #Ref<0.0.0.19915>,kl_ack_freq = 5,
                         kl_counter = undefined,our_kl_ready = false,
                         their_kl_ready = true,
                         stage_start = {1401,991635,163854},
                         partition_start = {1401,991635,163854},
                         skipping = false}}]}]]}

Looking at the links for the process led me to process <0.3895.0>:

{links,[<0.3893.0>,<0.3895.0>]},

The process state for <0.3895.0> was as follows:

(dev4@127.0.0.1)8> rp(sys:get_status(pid(0,3895,0))).
{status,<0.3895.0>,
        {module,riak_core_gen_server},
        [[{'$ancestors',[<0.3894.0>,<0.3893.0>,
                         riak_repl2_fssink_sup,riak_repl_sup,<0.748.0>]},
          {'$initial_call',{gen,init_it,6}}],
         running,<0.3894.0>,[],
         [{header,"Status for generic server <0.3895.0>"},
          {data,[{"Status",running},
                 {"Parent",<0.3894.0>},
                 {"Logged events",[]},
                 {"Queued messages",[]}]},
          {data,[{"State",
                  #state{transport = <0.3894.0>,socket = #Ref<0.0.0.19915>,
                         cluster = <0.3896.0>,
                         fullsync_worker = {file_descriptor,prim_file,
                                                            {#Port<0.21134>,61}},
                         work_dir = undefined,
                         strategy = "./data/riak_repl/work/83238997/A-127.0.0.1:10046-127.0.0.1:62346/1118962191081472546749696200048404186924073353216.ours.sterm",
                         proto = [],ver = 0}}]}]]}

It also seemed to be waiting on another process. It's links were: {links,[<0.3894.0>,<0.3896.0>,#Port<0.21134>]}

Process <0.3896.0> turned out to be the riak_repl_fullsync_helper process and its process_info output revealed the key information:

(dev4@127.0.0.1)10> rp(process_info(pid(0,3896,0))).
[{current_function,{riak_repl_fullsync_helper,'-handle_call/3-fun-1-',
                                              4}},
 {initial_call,{erlang,apply,2}},
 {status,waiting},
 {message_queue_len,1},
 {messages,[{'DOWN',#Ref<0.0.0.19921>,process,<0.1516.0>,
                    normal}]},
 {links,[<0.3895.0>]},
 {dictionary,[]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.747.0>},
 {total_heap_size,376},
 {heap_size,376},
 {stack_size,10},
 {reductions,62},
 {garbage_collection,[{min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,0},
                      {minor_gcs,0}]},
 {suspending,[]}]

The DOWN message sitting in the mailbox represented a message from the vnode that completed handoff and shutdown with a reason of normal.

In this particular case the problem was on the sink cluster, but I have observed it happen on the source cluster as well.

I have also found evidence of this failure on giddyup. Here is a console.log file that shows the same characteristic log messages associated with this issue.

Reproduction

I have had success triggering this problem using the verify_counter_repl riak_test. Running a loop of 20 iterations as follows has never failed to trigger the race:

for i in {1..20}; do ./riak_test -c riak_ee -t verify_counter_repl | tee verify-counter-$i.out; done

The failure maniests as the test getting stuck waiting for a fullsync completion and eventually timing out. At least one of the console.log files should contain riak_repl2_fscoordinator:refresh_stats_worker:866 Gathering source data for messages pertaining to the same partition repeated near the end of the file.

Resolution

The resolution is to remove the guard that excludes the monitor notifications resulting from normal vnode shutdown. Additionally, the error reason used to exit the fold worker process must be changed to indicate that despite the vnode shutting down normally the situation should be treated as an error condition from the perspective of fullsync replication and riak_repl_fullsync_helper process.

Fix a race condition between partition ownership handoff and fullsync replication that can cause fullsync to hang indefinitely when using the keylist strategy. The race occurs when the process of ownership handoff for a partition begins, then fullsync replication begins for that same partition before the ownership handoff has completed, and then handoff completes before the fullsync process has fully initiated the building of keylists. When handoff completes for the partition and the vnode process shuts down, the riak_repl_fullsync_helper process is left permanently stuck in a receive call and the associated riak_repl_keylist_server or riak_repl_keylist_client process is left perpetually waiting for a response from the riak_repl_fullsync_helper process that never comes.

bsparrow435 · 2014-06-06T13:15:48Z

This looks good and passed all repl tests locally for me.

bsparrow435 · 2014-06-06T16:16:59Z

👍

bsparrow435 · 2014-06-06T16:21:06Z

👍 77dd766

Race condition causes fullsync replication to never complete Reviewed-by: bsparrow435

kellymclaughlin · 2014-06-06T17:07:24Z

@borshop merge

cmeiklejohn · 2014-06-07T19:46:49Z

Thanks @kellymclaughlin.

kellymclaughlin added this to the 2.0-RC milestone Jun 6, 2014

borshop added a commit that referenced this pull request Jun 6, 2014

Merge pull request #599 from basho/bugfix/fullsync-stall-during-handoff

26ed632

Race condition causes fullsync replication to never complete Reviewed-by: bsparrow435

borshop merged commit 77dd766 into develop Jun 6, 2014

kellymclaughlin deleted the bugfix/fullsync-stall-during-handoff branch June 6, 2014 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition causes fullsync replication to never complete #599

Race condition causes fullsync replication to never complete #599

kellymclaughlin commented Jun 6, 2014

bsparrow435 commented Jun 6, 2014

bsparrow435 commented Jun 6, 2014

bsparrow435 commented Jun 6, 2014

kellymclaughlin commented Jun 6, 2014

cmeiklejohn commented Jun 7, 2014

Race condition causes fullsync replication to never complete #599

Race condition causes fullsync replication to never complete #599

Conversation

kellymclaughlin commented Jun 6, 2014

Description

Observations

Reproduction

Resolution

bsparrow435 commented Jun 6, 2014

bsparrow435 commented Jun 6, 2014

bsparrow435 commented Jun 6, 2014

kellymclaughlin commented Jun 6, 2014

cmeiklejohn commented Jun 7, 2014