Race condition causes fullsync replication to never complete #599
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
There is a race condition between partition ownership handoff and fullsync replication that can cause fullsync to hang indefinitely when using the keylist strategy. The race occurs when the process of ownership handoff for a partition begins, then fullsync replication begins for that same partition before the ownership handoff has completed, and then handoff completes before the fullsync process has fully initiated the building of keylists. When handoff completes for the partition and the vnode process shuts down, the
riak_repl_fullsync_helper
process is left permanently stuck in areceive
call and the associatedriak_repl_keylist_server
orriak_repl_keylist_client
process is left perpetually waiting for a response from theriak_repl_fullsync_helper
process that never comes.The reason this happens can be seen on these two lines. The vnode process shuts down normally after handoff completes and the guard explicitly ignores such a message and leaves the process waiting in the
receive
.Observations
I first observed this issue while reviewing and testing #590 (relevant comment here).
Using riak_test I was able to trigger the same race condition and examine the state of some of the relevant fullsync processes. After seeing repeated
riak_repl2_fscoordinator:refresh_stats_worker:866 Gathering source data for
messages on a source cluster node, the following logging on a sink cluster node led me to the first process for examination:Examination of the process state of
<0.3894.0>
led me to conclude that it was in a healthy state and waiting to receive a message from another process:Looking at the links for the process led me to process
<0.3895.0>
:The process state for
<0.3895.0>
was as follows:It also seemed to be waiting on another process. It's links were:
{links,[<0.3894.0>,<0.3896.0>,#Port<0.21134>]}
Process
<0.3896.0>
turned out to be theriak_repl_fullsync_helper
process and itsprocess_info
output revealed the key information:The
DOWN
message sitting in the mailbox represented a message from the vnode that completed handoff and shutdown with a reason ofnormal
.In this particular case the problem was on the sink cluster, but I have observed it happen on the source cluster as well.
I have also found evidence of this failure on giddyup. Here is a console.log file that shows the same characteristic log messages associated with this issue.
Reproduction
I have had success triggering this problem using the
verify_counter_repl
riak_test. Running a loop of 20 iterations as follows has never failed to trigger the race:The failure maniests as the test getting stuck waiting for a fullsync completion and eventually timing out. At least one of the console.log files should contain
riak_repl2_fscoordinator:refresh_stats_worker:866 Gathering source data for
messages pertaining to the same partition repeated near the end of the file.Resolution
The resolution is to remove the guard that excludes the monitor notifications resulting from normal vnode shutdown. Additionally, the error reason used to exit the fold worker process must be changed to indicate that despite the vnode shutting down normally the situation should be treated as an error condition from the perspective of fullsync replication and
riak_repl_fullsync_helper
process.