服务调用的超时问题 Some issues with service timeout #1968

baigod · 2018-06-20T16:23:03Z

dubbo version: 2.5.6

issues code 1

@Component
public class ConsumerA {

    @Reference(version = "1.0.0",retries = 0, timeout = 180000)
    public ProviderB providerB ;

    public void a(){
        try{
            providerB.b(); 
        }catch(Exception e){
            // do something
        }
    }

}

@Service(version = "1.0.0")
public class ProviderB {

    public void b(){ 
        try{
            Thread.sleep(10000); //休眠10秒为了模拟程序执行缓慢
        }catch(Exception e){
        }
    }

}

使用场景，当B服务提供者需要紧急重启的情况下：
此时，B宕机，如何让a()立即捕获到异常信息，而不用等待3分钟才返回，如果大量并发请求在请求中，将可能造成A服务大量请求堆积。这个是否有更友好的办法去处理，或者，我的配置并不完善。
Usage Scenarios When the B Service Provider Needs an Emergency Restart:
At this point, B crashes, how to make a() immediately capture exception information without waiting for 3 minutes to return. If a large number of concurrent requests are in the request, it may cause a large number of A service requests to be accumulated. Is there a more friendly way to deal with this, or my configuration is not perfect.

issues code 2

@Component
public class ConsumerA {

    @Reference(version = "1.0.0",retries = 0, timeout = 3000)
    public ProviderB providerB ;

    public void a(){
        updateSomeRows(); //更新某行数据
        try{
            providerB.b(); 
        }catch(Exception e){
            throw new XXException();
        }
    }

}

@Service(version = "1.0.0")
public class ProviderB {

    public void b(){ 
       insertRow();  //插入某行数据
        try{
            Thread.sleep(10000); //休眠10秒为了模拟程序执行缓慢
        }catch(Exception e){
        }
    }

}

使用场景，当B服务提供者遇到数据库性能瓶颈(updateSomeRows()执行缓慢)：
此时，A捕获超时异常，a()的事务因异常回滚，但是B的数据已插入，这是不符合预期的。请问怎么简单优雅地处理这样的问题（不考虑使用分布式事务的情况下）。
Usage scenario, when B service provider encounters a database performance bottleneck (execution of updateSomeRows() is slow):
At this point, A capture timeout exception, a () transaction rollback due to anomalies, but B's data has been inserted, which is not in line with expectations. How to handle such problems simply and gracefully (not considering the use of distributed transactions).

baigod · 2018-06-25T03:11:43Z

#519

wanglei-sky · 2018-06-29T04:16:43Z

顶

luyuanwan · 2018-07-02T15:26:18Z

对于issue1，可以考虑使用dubbo的心跳，如果心跳检测不到，可以直接返回，不再等待。 @baigod

baigod · 2018-07-03T11:27:07Z

@luyuanwan 具体怎么实现呢 : )

chickenlj · 2018-07-10T07:59:05Z

For case 2, I am afraid there are no perfect solutions, when a heuristic exception occur, you can only retry or check offline.

For case 1, I think it's meaningful to notify the hanging client requests as soon as possible. When Server goes down it will trigger disconnect on Client, and we can then return all requests related to this channel.

carryxyh · 2018-08-29T11:01:23Z

@baigod
Hi, I have fix the issue in case 1 by this pr:
#2185

For case 2, I agree with chickenlj.
U can keep it "eventually consistent".

Is there any other questions about this issue? If not, we will close it soon. :)

diecui1202 · 2018-09-05T07:26:45Z

@carryxyh Does the PR #2185 need to merge back to 2.6.x?

carryxyh · 2018-09-05T07:31:57Z

@diecui1202
Thanks for reminding me, I have not considered it before. I will submit a pr as soon as possible.

diecui1202 · 2018-09-05T08:01:00Z

@carryxyh Great thanks.

diecui1202 · 2018-09-06T01:29:51Z

#2451 has been merged to 2.6.x.

falcondsc · 2019-07-29T06:59:04Z

@carryxyh 你好, 我們專案中(使用 2.5.3)也遇到 issues code 1 的困擾. 看到這個 issue (修正在2.6.4)之後. 我下載 2.6.6 的代碼到我本地運行. 但是發現. 我從 consumer 送出服務調用請求並且進到 provider 之後,在 provider 返回處理結果前, 我手動把 provider 的進程刪除，我的 consumer 端仍然處於等待狀態. 為此我調試了 dubbo 2.6.6 底層代碼, 由於對 dubbo 的同訊架構不熟, 我只能從我調試的代碼看出, consumer 的確有收到 provider 斷開, 但是 consumer 收到通知的是 HeaderExchangeChannel 這個類, 而我看了git 上修正的 file, 關鍵的修正是在 HeaderExchangeHandler 類中的 disconnected 方法調用 DefaultFuture.closeChannel(channel);
因此我實驗性的修改了本地 HeaderExchangeChannel 的 close 方法加入調用 DefaultFuture.closeChannel(channel); 重新打包後, 在我本地運行. 才真正解決了此 issue code 1 的問題. 請問我這樣修改是正確的嗎? 謝謝!

carryxyh · 2019-07-29T07:41:21Z

@falcondsc
你好。

首先确认一个问题，就是你的provider的关闭方式是什么样的，我猜测你可能并不是正常关闭provider，而是使用kill -9这种方式关闭（或者其他的方式关闭，导致TCP通道并不是通过四次挥手来正常关闭的）。

如果你是通过正常方式关闭provider的TCP通道的，则consumer端的触发过程应该是：
TCP关闭 -> Netty中的channelInactive -> Dubbo里的HeaderExchangeHandler#disconnected

此时，正如我的PR中修复的那样，所有的事情按照预期的方式发生。

反之，如果你是通过kill -9或者provider端直接断网等更加暴力的方式关闭，则provider并不会向consumer端发送fin报文。在这种情况下，对于TCP通道来说，consumer端会仍然认为provider端的进程是存活的、能够处理请求的。

那为什么会触发HeaderExchangeChannel的close方法呢？
我认为，应该是dubbo自身的服务发现机制触发的。即provider下线，zk上provider相应数据被删除，zk再通知consumer端，consumer端收到通知后，变更provider列表，此时发现该provider下线，这时候会主动关闭这条TCP通道，关闭的时候会调用HeaderExchangeChannel的close。

falcondsc · 2019-07-29T08:25:45Z

@carryxyh
你好. 謝謝你的回覆, 是的, 我本地是透過暴力方式關閉 provider. 具體是我的 provider java 進程是透過 Eclipse IDE 啟動, consumer 也是, 當 consumer 請求進入到 provider 之後, 我就直接用 IDE 關閉 provider 進程.

的確如你所說的. 當我關閉 provider 進程時，consumer 的 HeaderExchangeChannel close 調用堆棧是從 dubbo 的 ZookeeperRegistry notify 所發起的.

issues code 1 的說明中提到 "B宕机，如何让a()立即捕获到异常信息"，所以我就直接使用 IDE 關閉 provider 來驗證這一個問題, 我們目前這邊的生產環境中搭配 docker swarm 佈署 provider 節點，並且使用 docker 健康檢查自動融斷沒有回應的節點(有時因為 db 資源不足, 實際是在等待 db，不過我們有做服務調用冪等性, 我可能還要檢查下被 docker 融斷的節點是否是docker 送出 kill -9指令). 所以想再請教一下當服務發現機制偵測到 provider 下線時通知 HeaderExchangeChannel close 方法. 而我在 close 方法內加上調用 DefaultFuture.closeChannel(channel); 會引起甚麼問題嗎?

carryxyh · 2019-07-29T08:36:29Z

我在 close 方法內加上調用 DefaultFuture.closeChannel(channel); 會引起甚麼問題嗎?

我认为不会引发什么问题，反而加上这个调用以后，功能会更加健壮。在任何情况下，当调用真正进行到 HeaderExchangeChannel#close 的时候，都意味着我们不再关心hang住的请求是否能够正常返回。

如果你愿意，可以创建一个新的issue来跟踪这个问题并且提交一个pull request来优化它。
:)

falcondsc · 2019-07-29T08:50:20Z

@carryxyh 好的. 謝謝. 為了這個問題. 我今天才註冊 GitHub. 等我們生產環境驗證後. 我再回頭看下要怎麼建新的 issue 並提交 pull request. :)

beiwei30 · 2019-07-30T07:40:08Z

@falcondsc I verified on the master branch. This solution doesn't work.

falcondsc · 2019-07-30T11:24:16Z

@beiwei30 你好. 請問一下你改的 master branch 是指主分支 2.7.x 嗎? 我測試是使用 2.6.6 的代碼進行修改的. 註冊中心使用的是 zookeeper. consumer 設定 timeout 為 10分鐘, 我並沒有對 2.7.x 版進行調整並測試這個問題.
本地目前在 2.6.6 調整完 HeaderExchangeChannel.close() 內先調用 DefaultFuture.closeChannel(channel); 或者選擇改在 finally 才調用 DefaultFuture.closeChannel(channel); 都是可以在 consumer 送出請求後, 等到 provider 接入後, 刪除 provider 進程, consumer 都可以在幾秒內收到拋出的 "Channel xx is inactive. Directly return the unFinished request" 異常. 在未調整前. 我的 consumer 都是要等到 10 分鐘後才能得到一個 timeout exception.
接著大概需要幾天的時間我會先把自己改過的版本放置我們測試區進行驗證. 隨後會再往前推進到我們的正式的生產環境.
(補充說明. 其實還有幾隻 dubbo 尋找 class loader 的代碼, 我是有特別修改為從當前線程取得 class loader , 但我認為不應該這幾隻修改, 會影響我本次測試的結果.)

以上的信息希望對你有些幫助.

falcondsc · 2019-07-30T11:45:23Z

補充一下: 另外我的 consumer retries 設定為 0

falcondsc · 2019-07-30T12:11:40Z

我另外測試了註冊中心是使用 Multicast 時. 那麼 consumer 的 HeaderExchangeChannel 將會是
public void close(int timeout) 這個方法被調用. 此場景下 consumer 也就只能繼續等待到 timeout 了.

beiwei30 · 2019-09-05T06:31:51Z

@beiwei30 你好. 請問一下你改的 master branch 是指主分支 2.7.x 嗎? 我測試是使用 2.6.6 的代碼進行修改的. 註冊中心使用的是 zookeeper. consumer 設定 timeout 為 10分鐘, 我並沒有對 2.7.x 版進行調整並測試這個問題.
本地目前在 2.6.6 調整完 HeaderExchangeChannel.close() 內先調用 DefaultFuture.closeChannel(channel); 或者選擇改在 finally 才調用 DefaultFuture.closeChannel(channel); 都是可以在 consumer 送出請求後, 等到 provider 接入後, 刪除 provider 進程, consumer 都可以在幾秒內收到拋出的 "Channel xx is inactive. Directly return the unFinished request" 異常. 在未調整前. 我的 consumer 都是要等到 10 分鐘後才能得到一個 timeout exception.
接著大概需要幾天的時間我會先把自己改過的版本放置我們測試區進行驗證. 隨後會再往前推進到我們的正式的生產環境.
(補充說明. 其實還有幾隻 dubbo 尋找 class loader 的代碼, 我是有特別修改為從當前線程取得 class loader , 但我認為不應該這幾隻修改, 會影響我本次測試的結果.)

以上的信息希望對你有些幫助.

@falcondsc 是的，是针对 master 分支做的分析和修复。2.6.x 上还麻烦你验证一下，如果工作的话，欢迎提 pull request 过来把 master 上的修复 backport 回来。

falcondsc · 2019-09-09T01:17:04Z

@beiwei30
好的. 我會在本周三(9/11) 以前, 在 2.6.x 分支上進行驗證.

falcondsc · 2019-09-10T17:52:58Z

好的. 我會在本周三(9/11) 以前, 在 2.6.x 分支上進行驗證.

我獲取了 2.6.x 分支在 HeaderExchangeChannel.close() 調用 DefaultFuture.closeChannel(channel); 方法. 但測試後發現調用到 HeaderExchangeChannel.close() 的堆棧和我先前用 2.6.6 去修改時有些不同, 2.6.x 分支代碼上慢了很多(關閉 dubbo provider 之後, 要等 30 秒以上)才會進到 consumer 端的HeaderExchangeChannel.close() .. 我先再繼續做些測試, 查看下是不是我兩個測試環境有不同而引起的問題.

falcondsc · 2019-09-11T16:17:15Z

@beiwei30 想再請教一下, 因為剛接觸 github. 所以我反覆看了整個 issue 和關聯出去的內容.
你說的在 2.6.x 驗證. 是指包含使用你的 4700 pr 和 4698 pr 變更的內容修改在我的 2.6.x repository上進行驗證嗎?

CrazyHZM · 2021-10-10T13:05:21Z

Try it with the latest version, if you still have problems, you can reopen the issues

chickenlj added the type/enhancement label Jul 10, 2018

chickenlj added this to the 2.6.4 milestone Jul 10, 2018

chickenlj mentioned this issue Aug 4, 2018

Return the hanging client requests ASAP. when connection is broken unexpectedly. #2184

Closed

diecui1202 added the status/fix-in-next-release label Sep 5, 2018

diecui1202 closed this as completed Sep 6, 2018

bigwg mentioned this issue Jul 30, 2019

When the provider interrupts abnormally, the consumer cannot return quickly and still waits for the timeout to end #4694

Closed

1 task

This was referenced Jul 30, 2019

org.apache.dubbo.remoting.exchange.support.DefaultFuture#closeChannel doesn't work as expected #4699

Closed

issue #4699: org.apache.dubbo.remoting.exchange.support.DefaultFuture closeChannel doesn't work as expected #4700

Merged

carryxyh mentioned this issue Jul 31, 2019

[Dubbo-4694] Fix consumer can't return quickly, when the provider interrupts abnormally #4694 #4698

Merged

4 tasks

beiwei30 reopened this Sep 5, 2019

beiwei30 modified the milestones: 2.6.4, 2.6.8 Sep 5, 2019

dangit815 mentioned this issue Oct 24, 2019

After the dubbo connection from consumer to provider is broken, the consumer does not rebuild the connection immediately, but waits for more than a minute before rebuilding the connection. #5234

Closed

2 tasks

CrazyHZM added type/proposal Everything you want Dubbo have and removed status/fix-in-next-release labels Sep 14, 2021

CrazyHZM closed this as completed Oct 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

服务调用的超时问题 Some issues with service timeout #1968

服务调用的超时问题 Some issues with service timeout #1968

baigod commented Jun 20, 2018 •

edited

Loading

baigod commented Jun 25, 2018

wanglei-sky commented Jun 29, 2018

luyuanwan commented Jul 2, 2018

baigod commented Jul 3, 2018

chickenlj commented Jul 10, 2018

carryxyh commented Aug 29, 2018

diecui1202 commented Sep 5, 2018

carryxyh commented Sep 5, 2018

diecui1202 commented Sep 5, 2018

diecui1202 commented Sep 6, 2018

falcondsc commented Jul 29, 2019

carryxyh commented Jul 29, 2019

falcondsc commented Jul 29, 2019

carryxyh commented Jul 29, 2019

falcondsc commented Jul 29, 2019

beiwei30 commented Jul 30, 2019

falcondsc commented Jul 30, 2019

falcondsc commented Jul 30, 2019

falcondsc commented Jul 30, 2019

beiwei30 commented Sep 5, 2019

falcondsc commented Sep 9, 2019

falcondsc commented Sep 10, 2019 •

edited

Loading

falcondsc commented Sep 11, 2019

CrazyHZM commented Oct 10, 2021

服务调用的超时问题 Some issues with service timeout #1968

服务调用的超时问题 Some issues with service timeout #1968

Comments

baigod commented Jun 20, 2018 • edited Loading

issues code 1

issues code 2

baigod commented Jun 25, 2018

wanglei-sky commented Jun 29, 2018

luyuanwan commented Jul 2, 2018

baigod commented Jul 3, 2018

chickenlj commented Jul 10, 2018

carryxyh commented Aug 29, 2018

diecui1202 commented Sep 5, 2018

carryxyh commented Sep 5, 2018

diecui1202 commented Sep 5, 2018

diecui1202 commented Sep 6, 2018

falcondsc commented Jul 29, 2019

carryxyh commented Jul 29, 2019

falcondsc commented Jul 29, 2019

carryxyh commented Jul 29, 2019

falcondsc commented Jul 29, 2019

beiwei30 commented Jul 30, 2019

falcondsc commented Jul 30, 2019

falcondsc commented Jul 30, 2019

falcondsc commented Jul 30, 2019

beiwei30 commented Sep 5, 2019

falcondsc commented Sep 9, 2019

falcondsc commented Sep 10, 2019 • edited Loading

falcondsc commented Sep 11, 2019

CrazyHZM commented Oct 10, 2021

baigod commented Jun 20, 2018 •

edited

Loading

falcondsc commented Sep 10, 2019 •

edited

Loading