Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

压力情况下出现领导者迁移 #845

Closed
LinHuiG opened this issue Jun 16, 2022 · 12 comments · Fixed by #969
Closed

压力情况下出现领导者迁移 #845

LinHuiG opened this issue Jun 16, 2022 · 12 comments · Fixed by #969
Assignees

Comments

@LinHuiG
Copy link

LinHuiG commented Jun 16, 2022

Your question

一个集群中创建多个raft实例(通过groupId区分),对其进行压力的时候,总是会出现领导者转移现象(Raft node receives higher term RequestVoteRequest),cpu负载不到70%,网络未拥堵,状态机内部使用了线程池异步处理。
是否因为业务队列满载导致领导者心跳失效?

thread_pool_metrics.log.2022-06-16_01-10-33.txt
node_metrics.log.2022-06-16_01-10-33.txt
node_describe.log.2022-06-16_01-10-33.txt

Your scenes

Describe your use scenes (why need this feature)

Your advice

Describe the advice or solution you'd like

Environment

  • SOFAJRaft version:1.3.10.bugfix_2
  • JVM version (e.g. java -version):1.8.
  • OS version (e.g. uname -a):centos7.8
  • Maven version:3.6.1
  • IDE version:
@LinHuiG
Copy link
Author

LinHuiG commented Jun 16, 2022

这个是不是与#830 是同一个问题

@LinHuiG
Copy link
Author

LinHuiG commented Jun 16, 2022

测试节点间网络速度大约在800Mb/s,观察到服务满载的时候占用了不到25Mb/s的带宽,同时,打压程序控制在途未完成任务量(未commit的日志数量)小于2000.

@LinHuiG
Copy link
Author

LinHuiG commented Jun 16, 2022

日志里有这一条,看上去是选举计时器超时了的样子,业务日志的复制会导致心跳超时吗
2022-06-16 03:17:13.888 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onStopFollowing: LeaderChangeContext [leaderId=10.10.184.171:9551, term=1, status=Status[ERAFTTIMEDOUT<10001>: Lost connection from leader 10.10.184.171:9551.]].

@LinHuiG
Copy link
Author

LinHuiG commented Jun 16, 2022

通过调整超时时间后问题不再出现,推测应该是业务日志同步和心跳共用rpc时没有设置优先级导致了挤占,我理解心跳的优先级应该是要高于业务日志的。。。。

@killme2008
Copy link
Contributor

应该是跟 #830 一样的问题

@killme2008
Copy link
Contributor

killme2008 commented Jun 16, 2022

这个问题可以通过一些限流来缓解, RafOptions 里的:

  • maxByteCountPerRpc 控制单个 RPC 请求大小
  • maxEntriesSize 单次发送 log 数量
  • maxBodySize 单次发送 log 字节数
  • maxAppendBufferSize 强制刷写磁盘最大字节数

这些参数来调节。

@LinHuiG
Copy link
Author

LinHuiG commented Jun 28, 2022

我后来看了一下,服务器的压力在磁盘上(磁盘占用率90+%),也就是日志落地的地方阻塞影响了心跳,限流的办法缓解效果并不是很好。

@j9kkk
Copy link

j9kkk commented Nov 19, 2022

hello,这个问题是否解决了,是不是可以在状态机apply的时候刷新一下候选定时器,这样也可以解决这个问题

@LinHuiG
Copy link
Author

LinHuiG commented Mar 22, 2023

hello,这个问题解决了吗

@ReycoLi
Copy link

ReycoLi commented Apr 11, 2023

我们遇到了同样的问题,坐等后续回复

@killme2008 killme2008 self-assigned this Apr 12, 2023
@killme2008
Copy link
Contributor

我会尝试做个修复。

@LinHuiG
Copy link
Author

LinHuiG commented Aug 3, 2023

我会尝试做个修复。

请问一下这个修复将在1.3.14版本发布吗,有计划什么时候发吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants