fix: reset waitId unexpectedly #847

killme2008 · 2022-06-18T12:53:18Z

Motivation:

Try to fix #838 #842

Right now the replicator will wait on new logs when there is no more logs to be replicated by invoking LogManager#wait(expectLogIndex, closure, threadId) method, then log manager will create and save a WaitMeta in memory map waitMap and returns a valid waitId(starts from one) to replicator. The replicator will check whether waitId >= 0 to prevent waiting more than once. So in normal case, there is at most one WaitMeta instance for each replicator.

But when user try to submit large log entries to jraft node, the bolt may checking connection fails because of netty writing buffer limitation. Checking the bolt log ~/logs/bolt/remoting-rpc.log:

2022-06-18 20:26:15,967 WARN  [?#] [Append-Entries-Thread-Send1] check failed. address: 192.168.165.240:5004, connection: com.alipay.remoting.Connection@d36d504
com.alipay.remoting.exception.RemotingException: Check connection failed for address: Origin url [192.168.165.240:5004], Unique key [192.168.165.240:5004]., maybe write overflow!
	at com.alipay.remoting.DefaultConnectionManager.check(DefaultConnectionManager.java:363)
	at com.alipay.remoting.rpc.RpcClient.checkConnection(RpcClient.java:447)
	at com.alipay.sofa.jraft.rpc.impl.BoltRpcClient.checkConnection(BoltRpcClient.java:79)
	at com.alipay.sofa.jraft.rpc.impl.AbstractClientService.checkConnection(AbstractClientService.java:90)
	at com.alipay.sofa.jraft.rpc.impl.core.DefaultRaftClientService.appendEntries(DefaultRaftClientService.java:119)
	at com.alipay.sofa.jraft.core.Replicator.sendEntries(Replicator.java:1681)
	at com.alipay.sofa.jraft.core.Replicator.sendEntries(Replicator.java:1595)
	at com.alipay.sofa.jraft.core.Replicator.onRpcReturned(Replicator.java:1366)

Then the replicator will block a few moment and resume the replication after timeout:

 private static boolean onAppendEntriesReturned
     ....
      if (!status.isOk()) {
            // If the follower crashes, any RPC to the follower fails immediately,
            // so we need to block the follower for a while instead of looping until
            // it comes back or be removed
            // dummy_id is unlock in block
            if (isLogDebugEnabled) {
                sb.append(" fail, sleep, status=") //
                    .append(status);
                LOG.debug(sb.toString());
            }
            notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
            if ((r.consecutiveErrorTimes++) % 10 == 0) {
                LOG.warn("Fail to issue RPC to {}, consecutiveErrorTimes={}, error={}", r.options.getPeerId(),
                    r.consecutiveErrorTimes, status);
            }
            r.resetInflights();
            r.setState(State.Probe);
            // unlock in in block
            r.block(startTimeMs, status.getCode());
            return false;
        }
  .....

After block timeout, it will reset the waitId and sending entries again:

static boolean continueSending(final ThreadId id, final int errCode) {
    ....
   r.waitId = -1; // reset the waitId
   if (errCode == RaftError.ETIMEDOUT.getNumber()) {
            r.blockTimer = null;
            // Send empty entries after block timeout to check the correct
            // _next_index otherwise the replicator is likely waits in            executor.shutdown();
            // _wait_more_entries and no further logs would be replicated even if the
            // last_index of this followers is less than |next_index - 1|
            r.sendProbeRequest();
        } else if (errCode != RaftError.ESTOP.getNumber()) {
            // id is unlock in _send_entries
            r.sendEntries();
        } 
}

So the waitId is reset unexpectedly, and creating too many WaitMeta instances in LogManager. When new log entries are appended, LogManager will notify each WaitMeta closure in a thread task, it may cause closure threadpool too busy and throws RejectedExecutionexception.

Modification:

Only reset waitId before sending entries and log error status at first time.

Result:

Fixes #838 #842

TODO

Guideline for submit large log entries.

…e and improve log in replciator #842, #838

fengjiachun

LGTM

killme2008 requested a review from fengjiachun June 18, 2022 12:53

killme2008 changed the title ~~fix: waitId may case memory leak when replicator blocks on network is…~~ fix: waitId may cause memory leak Jun 18, 2022

killme2008 force-pushed the fix/waitmeta-leak branch from 6f6ecbb to 888c77b Compare June 18, 2022 12:57

killme2008 changed the title ~~fix: waitId may cause memory leak~~ fix: waitMeta leak Jun 18, 2022

killme2008 changed the title ~~fix: waitMeta leak~~ fix: reset waitId unexpectedly Jun 18, 2022

fix: reset waitId unexpectedly when replicator blocks on network issu…

6d74e93

…e and improve log in replciator #842, #838

killme2008 force-pushed the fix/waitmeta-leak branch from 888c77b to 6d74e93 Compare June 18, 2022 12:59

test: fix testMetricRemoveOnDestroy

16d3d2a

fengjiachun approved these changes Jun 20, 2022

View reviewed changes

fengjiachun merged commit aec90ee into master Jun 20, 2022

fengjiachun deleted the fix/waitmeta-leak branch June 20, 2022 03:24

fengjiachun mentioned this pull request Jun 20, 2022

1.3.11 #849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reset waitId unexpectedly #847

fix: reset waitId unexpectedly #847

killme2008 commented Jun 18, 2022 •

edited

Loading

fengjiachun left a comment

fix: reset waitId unexpectedly #847

fix: reset waitId unexpectedly #847

Conversation

killme2008 commented Jun 18, 2022 • edited Loading

Motivation:

Modification:

Result:

TODO

fengjiachun left a comment

Choose a reason for hiding this comment

killme2008 commented Jun 18, 2022 •

edited

Loading