Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add peer时append log超时导致add peer task被丢弃 #865

Closed
jackjoesh opened this issue Jul 12, 2022 · 1 comment
Closed

add peer时append log超时导致add peer task被丢弃 #865

jackjoesh opened this issue Jul 12, 2022 · 1 comment

Comments

@jackjoesh
Copy link

jackjoesh commented Jul 12, 2022

背景现象:
这个问题和我之前提的那个issue([https://github.com//issues/830])是类似的,我们新启的follower需要追的log比较多,会达到亿级别。
append log有可能会堵塞住jraft log disruptor, 从而造成append log请求超时。

最后在如下代码的判断里,因为append log请求超时超过了ElectionTimeoutMs,则add peer task会失败,replicator会停止,follower永远都无法加入集群了。(ElectionTimeoutMs我们使用的是默认的1000ms)

&& Utils.monotonicMs() - this.replicatorGroup.getLastRpcSendTimestamp(peer) <= this.options

如何解决这个问题

关于限流
之前有提过通过限流leader append log的并发量,避免follower在追进度时出现堵塞情况,从而影响heartbeat或append log。
我这里发现如下的参数是比较影响并发量的
private int maxEntriesSize = 1024; //单批次日志数量
private int maxReplicatorInflightMsgs = 256; //pipeline 在途queue大小
是否调小这两个参数可以降低 append log的并发量? 有没有一个多少tps对应多少的推荐配置呢?(我之前发现调整maxReplicatorInflightMsgs比较有用,但是如果调太小会造成leader because of too many pending responses, 从而follower主动断连leader的情况)
目前苦于没有这个最佳配置, 还是说有更好的配置方法?

关于onCaughtUp超时判断
这块的敏感度是否太高了,如果add peer时就是要追很多log,那将大大增加add peer失败的概率。就算我中间出现了几次append log超时,但是只要log能追上,最后add peer也能成功吧?有没有一个解决办法呢

谢谢

@killme2008
Copy link
Contributor

killme2008 commented Jul 13, 2022

你们的 snapshot 是不是间隔太久? 如果吞吐较大,建议 snapshot 间隔可以缩短一些,新加入节点可以直接从 snapsot 拷贝+复制剩余 log 来加速启动。

不过这里一直想做个优化, 新增节点的超时可以设置的大一些,其次, 你暂时可以通过调整 catchUpMargin 参数来设置 catchup 的间距,适当调大来避免这种情况,代价是新增节点可能短期内无法提供正常服务,这个可以通过业务自身设计避免短期内访问新增节点来解决。

https://github.com/sofastack/sofa-jraft/blob/master/jraft-core/src/main/java/com/alipay/sofa/jraft/option/NodeOptions.java#L84

    // We will regard a adding peer as caught up if the margin between the
    // last_log_index of this peer and the last_log_index of leader is less than
    // |catchup_margin|
    //
    // Default: 1000
    private int                             catchupMargin          = 1000;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants