Better handling of fscoordinator communication on notification and clearing of rt_dirty #304

jcapricebasho · 2013-06-12T17:12:15Z

(All links and line refs are to 1.3.1 tag)

There are two areas in the riak_repl code that could result in the fscoordinator not knowing about a node with an rt_dirty count > 0.

The first is in the rt_dirty function of riak_repl_stats [0]. Once a node is marked as dirty, it is possible that the communication of the event with the fscoordinator fails. Currently, this is being logged at the debug level:

lager:debug("Failed to notify coordinator of rt_dirty status")

However, an event like this logged at warning would allow debugging of the issue at lager's info level which would facilitate instructing clients to manually clear the rt_dirty flag using riak_repl_stats: clear_rt_dirty().

The second is in the notify_rt_dirty_nodes function of riak_repl2_fscoordinator [1]. After a successful fullsync, the multicall on line 723:

rpc:multicall(NodesToNotify, riak_repl_stats, clear_rt_dirty, []),

does not track the results of the multicall. Immediately after the multicall, the list of dirty nodes is cleared:

State#state{dirty_nodes=ordsets:new()};

If there is an error during the multicall, or a node is unreachable for some reason, the rt_dirty flag is not cleared on the remote node and the coordinator no longer considers it dirty.

Tracking these errors and not removing them from the list of dirty nodes will allow an attempt to clear that rt_dirty counter on the next fullsync.

Both of these scenarios create situations where nodes have an rt_dirty count that remains constant through fullsyncs (until the rt_dirty count is incremented again, at which time rt_dirty() will attempt to update the fscoordinator.)

The text was updated successfully, but these errors were encountered:

jcapricebasho · 2013-06-12T17:14:23Z

basho/riak_ee-issues#7

bookshelfdave · 2013-06-12T18:22:02Z

Suggested by @joecaswell: re-sync dirty status with every fullsync

Before starting a fullsync:

 Owners = riak_core_ring:all_owners(Ring),
           [ case rpc:call(Node, riak_repl_stats, is_rt_dirty,[]) of
                   false -> ok;
                   _ -> riak_repl2_fscoordinator:node_dirty(Node)
               end || Node <- Owners ],

cmeiklejohn · 2014-03-24T19:33:44Z

Moving to 2.1.

engelsanchez · 2014-12-04T16:34:01Z

While reviewing the fullsync coordinator code as a group, we noticed this problem with the node dirty notifications. It does seem we can do better than our current approach in case of temporary partitions, and it should definitely be at least a warning in the logs.

engelsanchez · 2014-12-04T17:14:46Z

#389 has relevant changes to warn when propagation of rt_dirty node information fails.

ghost assigned bookshelfdave Jul 22, 2013

bookshelfdave mentioned this issue Jul 22, 2013

cleanup rt_dirty behavior #333

Closed

cmeiklejohn modified the milestones: 2.1, 2.0 Mar 24, 2014

bookshelfdave removed their assignment Mar 27, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of fscoordinator communication on notification and clearing of rt_dirty #304

Better handling of fscoordinator communication on notification and clearing of rt_dirty #304

jcapricebasho commented Jun 12, 2013

jcapricebasho commented Jun 12, 2013

bookshelfdave commented Jun 12, 2013

cmeiklejohn commented Mar 24, 2014

engelsanchez commented Dec 4, 2014

engelsanchez commented Dec 4, 2014

Better handling of fscoordinator communication on notification and clearing of rt_dirty #304

Better handling of fscoordinator communication on notification and clearing of rt_dirty #304

Comments

jcapricebasho commented Jun 12, 2013

jcapricebasho commented Jun 12, 2013

bookshelfdave commented Jun 12, 2013

cmeiklejohn commented Mar 24, 2014

engelsanchez commented Dec 4, 2014

engelsanchez commented Dec 4, 2014