You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
// We will regard a adding peer as caught up if the margin between the// last_log_index of this peer and the last_log_index of leader is less than// |catchup_margin|//// Default: 1000privateintcatchupMargin = 1000;
背景现象:
这个问题和我之前提的那个issue([https://github.com//issues/830])是类似的,我们新启的follower需要追的log比较多,会达到亿级别。
append log有可能会堵塞住jraft log disruptor, 从而造成append log请求超时。
最后在如下代码的判断里,因为append log请求超时超过了ElectionTimeoutMs,则add peer task会失败,replicator会停止,follower永远都无法加入集群了。(ElectionTimeoutMs我们使用的是默认的1000ms)
sofa-jraft/jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java
Line 2174 in 1d57d08
如何解决这个问题
关于限流
之前有提过通过限流leader append log的并发量,避免follower在追进度时出现堵塞情况,从而影响heartbeat或append log。
我这里发现如下的参数是比较影响并发量的
private int maxEntriesSize = 1024; //单批次日志数量
private int maxReplicatorInflightMsgs = 256; //pipeline 在途queue大小
是否调小这两个参数可以降低 append log的并发量? 有没有一个多少tps对应多少的推荐配置呢?(我之前发现调整maxReplicatorInflightMsgs比较有用,但是如果调太小会造成leader because of too many pending responses, 从而follower主动断连leader的情况)
目前苦于没有这个最佳配置, 还是说有更好的配置方法?
关于onCaughtUp超时判断
这块的敏感度是否太高了,如果add peer时就是要追很多log,那将大大增加add peer失败的概率。就算我中间出现了几次append log超时,但是只要log能追上,最后add peer也能成功吧?有没有一个解决办法呢
谢谢
The text was updated successfully, but these errors were encountered: