Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] A proposal to use IdleStateHandler to replace using Timer to send HeartBeat. #3151

Closed
lexburner opened this issue Jan 7, 2019 · 14 comments

Comments

@lexburner
Copy link
Contributor

Dubbo 在应用层面发送心跳包保证连接的可用性,使用了定时器的设计,在客户端和服务端分别设置一个定时器,发送心跳,当发现连接断开时,客户端负责重连,服务端负责 close。使用定时器并不是一个好的设计,在忙通信时,心跳是不必要的。建议使用 Netty 的 IdleStateHandler,仅仅在检测到空闲连接时发送心跳。

修改建议:

  • 使用 IdleStateHandler 代替 Timer 发送心跳
  • 关闭 ChannelOption.SO_KEEPALIVE,网络层面的 TCP 断连需要在机器级别设置,默认是 2 小时,几乎没有必要存在,却发出了无必要的 TCP 探测包,仅仅依赖于应用层的心跳来给连接保活即可。

Dubbo sends a heartbeat packet at the application level to ensure the availability of the connection. A timer is set on the client and the server to send a heartbeat. When the connection is found to be disconnected, the client is responsible for reconnection and the server is responsible for close. Using a timer is not a good design, and the heartbeat is unnecessary when communicating busy. It is recommended to use Netty's IdleStateHandler to send a heartbeat only when an idle connection is detected.

Proposed changes:

  • Send heartbeats using IdleStateHandler instead of Timer
  • Close ChannelOption.SO_KEEPALIVE, TCP disconnection at the network level needs to be set at the machine level. The default is 2 hours. There is almost no need to exist, but an unnecessary TCP probe packet is issued. It only depends on the heartbeat of the application layer to keep the connection alive. Just fine.
@carryxyh
Copy link
Member

carryxyh commented Jan 7, 2019

if ((lastRead != null && now() - lastRead > heartbeat)
    || (lastWrite != null && now() - lastWrite > heartbeat)) {

目前在忙通讯的时候确实没有发送心跳包。只是定时检查,不一定定时发送。

关闭 ChannelOption.SO_KEEPALIVE,网络层面的 TCP 断连需要在机器级别设置,默认是 2 小时

同意。但是优化空间不大,聊胜于无。

@lexburner
Copy link
Contributor Author

@chickenlj did you find that the Timer of HeartBeat send may be out of the time set?
@carryxyh I think the IdleStateHandler is a graceful design and the Timer design can use at those Non-Netty cases.

@kexianjun
Copy link
Member

kexianjun commented Jan 7, 2019

@chickenlj did you find that the Timer of HeartBeat send may be out of the time set?
@carryxyh I think the IdleStateHandler is a graceful design and the Timer design can use at those Non-Netty cases.

netty的设计确实更合理一些,目前dubbo的心跳检查并不能非常及时的检测超时,可能会有一个heartbeat的延时才能检测出来,而netty的IdleStateHandler设计每次检测超时都重新计算下一次检测的时间,因此相对来说就能比较及时的检查到超时了

@lexburner
Copy link
Contributor Author

@kexianjun Do you have some existing cases to describe why dubbo's Timer will out of time at some time? I did receive some feedbacks about Timer HeartBeat, but can't recreate it.

@carryxyh
Copy link
Member

carryxyh commented Jan 7, 2019

netty的设计确实更合理一些,目前dubbo的心跳检查并不能非常及时的检测超时,可能会有一个heartbeat的延时才能检测出来

目前是1/4的heartbeat延时,这个已经调整过了。

netty的IdleStateHandler设计每次检测超时都重新计算下一次检测的时间,因此相对来说就能比较及时的检查到超时了

netty的这个我还没有细看,抽空看下,如果延迟更小,效果更好,同意使用idle。

@carryxyh
Copy link
Member

carryxyh commented Jan 7, 2019

beiwei在邮件列表上提醒了我们
由于dubbo存在多个通讯框架的支持,最好还是使用相同的方式来进行心跳检测。否则会使得dubbo的代码比较复杂。

我们完全没有必要在代码中加入类似

if transport == netty {
     don't start heartbeat timer
}

另外,由于目前心跳检测的频率已经提高了四倍,1/4心跳周期的延迟我认为完全可以接受 @kexianjun

@lexburner
Copy link
Contributor Author

lexburner commented Jan 7, 2019 via email

@kexianjun
Copy link
Member

kexianjun commented Jan 7, 2019

@kexianjun Do you have some existing cases to describe why dubbo's Timer will out of time at some time? I did receive some feedbacks about Timer HeartBeat, but can't recreate it.

@lexburner 检查时间间隔在启动以后就是确定的,比如说如下场景:
超时检查时间间隔为5秒,在00秒进行了一次检查,从01秒开始就没有接收到心跳(包括其他数据包),05秒检查,没有超时,10秒才能检测出来,其实在06秒就已经超时了

@carryxyh
Copy link
Member

carryxyh commented Jan 7, 2019

我说一下我的观点。
首先,timer和心跳完全没关系,timer抽象的定时器,不是心跳。跟心跳有关的抽象是Header层,尤其是HeaderExchangeClient只有一个心跳功能,就是因为有这样一层抽象在,我们才可以不关心底层是netty还是mina,而只需要关心具体实现,我觉得这个设计没有问题,非常棒。

其次,功能层面看,idle也好,timer也好,无非都是定时检查,schedule或者说轮训的,都不怎么优雅,区别是拿到外面还是放到pipeline里面。如果说netty的idle实现更精确的检查,个人认为更好的方式是借鉴它的实现方式。

@kexianjun
Copy link
Member

kexianjun commented Jan 8, 2019

@lexburner @carryxyh 我看了下目前dubbo触发心跳检查的实现,每次检测完以后会往定时器重新放置任务

timer.newTimeout(timeout.task(), tick, TimeUnit.MILLISECONDS);

而且每次启动心跳检检测的时候都会重新创建HashedWheelTimer

heartbeatTimer = new HashedWheelTimer(new NamedThreadFactory("dubbo-server-heartbeat", true), tickDuration,
TimeUnit.MILLISECONDS, Constants.TICKS_PER_WHEEL);

是否可以整个应用共用一个HashedWheelTimer,给每个已经建立连接的socketChannel添加心跳检测任务,然后每次执行完心跳检查任务以后重新计算并设置触发下一次心跳检测的时间,再往定时器里面添加心跳检测任务,这样不仅避免了创建多个HashedWheelTimer,而且能更及时的检测到是否超时,并且也不用针对netty进行特例判断,不知道这种方式是不是更好一些?

@lexburner
Copy link
Contributor Author

lexburner commented Jan 15, 2019

Here is my suggestion to improve Dubbo’s HeartBeat Design.

  • The design of the two-way heartbeat is unnecessary, compatible with the existing logic, allowing the client to send a one-way heartbeat when the connection is idle, and the server periodically detects the connection availability. Timed time to ensure: client timeout * 3 ≈ server timeout
  • Remove the timing tasks for reconnection and disconnection. Dubbo can judge whether the heartbeat request fails to respond. You can learn from the design of the improved scheme, maintain a mark of the number of heartbeat failures at the connection level, and successfully respond to any failures. Clear the mark; continuous heartbeat failure. The client initiates a reconnection. This can reduce an unnecessary timer, and any polling method is not elegant.

I've described more details in my blog, about Dubbo's existing heartbeat program, its shortcomings, and the advantages of replacing it with a new solution, as well as some forward thinking, welcome to discuss.

bolg:https://www.cnkirito.moe/heartbeat-design

@lexburner
Copy link
Contributor Author

I will submit a pr for this issue.

@lexburner
Copy link
Contributor Author

Since the pull requests have been merged. I close the issue now. You can find the details as below:
#3276
#3299

@wangkaish
Copy link

不能同意更多,大老远跑来给你点赞👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants