Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition causes fullsync replication to never complete #599

Merged
merged 1 commit into from
Jun 6, 2014

Conversation

kellymclaughlin
Copy link
Contributor

Description

There is a race condition between partition ownership handoff and fullsync replication that can cause fullsync to hang indefinitely when using the keylist strategy. The race occurs when the process of ownership handoff for a partition begins, then fullsync replication begins for that same partition before the ownership handoff has completed, and then handoff completes before the fullsync process has fully initiated the building of keylists. When handoff completes for the partition and the vnode process shuts down, the riak_repl_fullsync_helper process is left permanently stuck in a receive call and the associated riak_repl_keylist_server or riak_repl_keylist_client process is left perpetually waiting for a response from the riak_repl_fullsync_helper process that never comes.

The reason this happens can be seen on these two lines. The vnode process shuts down normally after handoff completes and the guard explicitly ignores such a message and leaves the process waiting in the receive.

Observations

I first observed this issue while reviewing and testing #590 (relevant comment here).

Using riak_test I was able to trigger the same race condition and examine the state of some of the relevant fullsync processes. After seeing repeated riak_repl2_fscoordinator:refresh_stats_worker:866 Gathering source data for messages on a source cluster node, the following logging on a sink cluster node led me to the first process for examination:

2014-06-05 12:07:14.962 [info] <0.3761.0>@riak_core_handoff_sender:start_fold:152 Starting ownership_transfer transfer of riak_kv_vnode from 'dev4@127.0.0.1' 1118962191081472546749696200048404186924073353216 to 'dev5@127.0.0.1' 1118962191081472546749696200048404186924073353216
2014-06-05 12:07:14.965 [info] <0.3760.0>@riak_core_handoff_sender:start_fold:152 Starting ownership_transfer transfer of riak_kv_vnode from 'dev4@127.0.0.1' 1141798154164767904846628775559596109106197299200 to 'dev6@127.0.0.1' 1141798154164767904846628775559596109106197299200
2014-06-05 12:07:15.118 [info] <0.3760.0>@riak_core_handoff_sender:start_fold:242 ownership_transfer transfer of riak_kv_vnode from 'dev4@127.0.0.1' 1141798154164767904846628775559596109106197299200 to 'dev6@127.0.0.1' 1141798154164767904846628775559596109106197299200 completed: sent 494.00 B bytes in 1 of 1 objects in 0.14 seconds (3.47 KB/second)
2014-06-05 12:07:15.118 [info] <0.3761.0>@riak_core_handoff_sender:start_fold:242 ownership_transfer transfer of riak_kv_vnode from 'dev4@127.0.0.1' 1118962191081472546749696200048404186924073353216 to 'dev5@127.0.0.1' 1118962191081472546749696200048404186924073353216 completed: sent 494.00 B bytes in 1 of 1 objects in 0.14 seconds (3.46 KB/second)
2014-06-05 12:07:15.160 [info] <0.3889.0>@riak_core_service_mgr:try_ssl:358 "B" and "A" agreed to not use SSL
2014-06-05 12:07:15.163 [info] <0.3894.0>@riak_repl_keylist_client:wait_for_fullsync:94 Full-sync with site "A" starting; 1 partitions.
2014-06-05 12:07:15.163 [info] <0.3894.0>@riak_repl_keylist_client:request_partition:127 Full-sync with site "A"; starting fullsync for 1118962191081472546749696200048404186924073353216
2014-06-05 12:07:15.163 [info] <0.3894.0>@riak_repl_keylist_client:request_partition:133 Full-sync with site "A"; building keylist for 1118962191081472546749696200048404186924073353216, 0 remain

Examination of the process state of <0.3894.0> led me to conclude that it was in a healthy state and waiting to receive a message from another process:

(dev4@127.0.0.1)4> rp(sys:get_status(pid(0,3894,0))).
{status,<0.3894.0>,
        {module,gen_fsm},
        [[{'$ancestors',[<0.3893.0>,riak_repl2_fssink_sup,
                         riak_repl_sup,<0.748.0>]},
          {'$initial_call',{riak_repl_keylist_client,init,1}}],
         running,<0.3893.0>,[],
         [{header,"Status for state machine <0.3894.0>"},
          {data,[{"Status",running},
                 {"Parent",<0.3893.0>},
                 {"Logged events",[]},
                 {"StateName",request_partition}]},
          {data,[{"StateData",
                  #state{sitename = "A",socket = #Port<0.21130>,
                         transport = ranch_tcp,
                         work_dir = "./data/riak_repl/work/83238997/A-127.0.0.1:10046-127.0.0.1:62346",
                         partitions = [],
                         partition = 1118962191081472546749696200048404186924073353216,
                         kl_fn = "./data/riak_repl/work/83238997/A-127.0.0.1:10046-127.0.0.1:62346/1118962191081472546749696200048404186924073353216.ours.sterm",
                         kl_fh = undefined,kl_pid = <0.3895.0>,
                         kl_ref = #Ref<0.0.0.19915>,kl_ack_freq = 5,
                         kl_counter = undefined,our_kl_ready = false,
                         their_kl_ready = true,
                         stage_start = {1401,991635,163854},
                         partition_start = {1401,991635,163854},
                         skipping = false}}]}]]}

Looking at the links for the process led me to process <0.3895.0>:

{links,[<0.3893.0>,<0.3895.0>]},

The process state for <0.3895.0> was as follows:

(dev4@127.0.0.1)8> rp(sys:get_status(pid(0,3895,0))).
{status,<0.3895.0>,
        {module,riak_core_gen_server},
        [[{'$ancestors',[<0.3894.0>,<0.3893.0>,
                         riak_repl2_fssink_sup,riak_repl_sup,<0.748.0>]},
          {'$initial_call',{gen,init_it,6}}],
         running,<0.3894.0>,[],
         [{header,"Status for generic server <0.3895.0>"},
          {data,[{"Status",running},
                 {"Parent",<0.3894.0>},
                 {"Logged events",[]},
                 {"Queued messages",[]}]},
          {data,[{"State",
                  #state{transport = <0.3894.0>,socket = #Ref<0.0.0.19915>,
                         cluster = <0.3896.0>,
                         fullsync_worker = {file_descriptor,prim_file,
                                                            {#Port<0.21134>,61}},
                         work_dir = undefined,
                         strategy = "./data/riak_repl/work/83238997/A-127.0.0.1:10046-127.0.0.1:62346/1118962191081472546749696200048404186924073353216.ours.sterm",
                         proto = [],ver = 0}}]}]]}

It also seemed to be waiting on another process. It's links were: {links,[<0.3894.0>,<0.3896.0>,#Port<0.21134>]}

Process <0.3896.0> turned out to be the riak_repl_fullsync_helper process and its process_info output revealed the key information:

(dev4@127.0.0.1)10> rp(process_info(pid(0,3896,0))).
[{current_function,{riak_repl_fullsync_helper,'-handle_call/3-fun-1-',
                                              4}},
 {initial_call,{erlang,apply,2}},
 {status,waiting},
 {message_queue_len,1},
 {messages,[{'DOWN',#Ref<0.0.0.19921>,process,<0.1516.0>,
                    normal}]},
 {links,[<0.3895.0>]},
 {dictionary,[]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.747.0>},
 {total_heap_size,376},
 {heap_size,376},
 {stack_size,10},
 {reductions,62},
 {garbage_collection,[{min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,0},
                      {minor_gcs,0}]},
 {suspending,[]}]

The DOWN message sitting in the mailbox represented a message from the vnode that completed handoff and shutdown with a reason of normal.

In this particular case the problem was on the sink cluster, but I have observed it happen on the source cluster as well.

I have also found evidence of this failure on giddyup. Here is a console.log file that shows the same characteristic log messages associated with this issue.

Reproduction

I have had success triggering this problem using the verify_counter_repl riak_test. Running a loop of 20 iterations as follows has never failed to trigger the race:

for i in {1..20}; do ./riak_test -c riak_ee -t verify_counter_repl | tee verify-counter-$i.out; done

The failure maniests as the test getting stuck waiting for a fullsync completion and eventually timing out. At least one of the console.log files should contain riak_repl2_fscoordinator:refresh_stats_worker:866 Gathering source data for messages pertaining to the same partition repeated near the end of the file.

Resolution

The resolution is to remove the guard that excludes the monitor notifications resulting from normal vnode shutdown. Additionally, the error reason used to exit the fold worker process must be changed to indicate that despite the vnode shutting down normally the situation should be treated as an error condition from the perspective of fullsync replication and riak_repl_fullsync_helper process.

Fix a race condition between partition ownership handoff and fullsync
replication that can cause fullsync to hang indefinitely when using
the keylist strategy. The race occurs when the process of ownership
handoff for a partition begins, then fullsync replication begins for
that same partition before the ownership handoff has completed, and
then handoff completes before the fullsync process has fully initiated
the building of keylists. When handoff completes for the partition and
the vnode process shuts down, the riak_repl_fullsync_helper process is
left permanently stuck in a receive call and the associated
riak_repl_keylist_server or riak_repl_keylist_client process is left
perpetually waiting for a response from the riak_repl_fullsync_helper
process that never comes.
@kellymclaughlin kellymclaughlin added this to the 2.0-RC milestone Jun 6, 2014
@bsparrow435
Copy link
Contributor

This looks good and passed all repl tests locally for me.

@bsparrow435
Copy link
Contributor

👍

@bsparrow435
Copy link
Contributor

👍 77dd766

borshop added a commit that referenced this pull request Jun 6, 2014
Race condition causes fullsync replication to never complete

Reviewed-by: bsparrow435
@kellymclaughlin
Copy link
Contributor Author

@borshop merge

@borshop borshop merged commit 77dd766 into develop Jun 6, 2014
@kellymclaughlin kellymclaughlin deleted the bugfix/fullsync-stall-during-handoff branch June 6, 2014 17:12
@cmeiklejohn
Copy link
Contributor

Thanks @kellymclaughlin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants