Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用consul创建channel之后, channel析构会卡住 #584

Closed
brianjcj opened this issue Nov 26, 2018 · 19 comments
Closed

使用consul创建channel之后, channel析构会卡住 #584

brianjcj opened this issue Nov 26, 2018 · 19 comments

Comments

@brianjcj
Copy link

brianjcj commented Nov 26, 2018

Describe the bug (描述bug)

使用consul创建channel之后, channel析构会卡住。触发析构的线程会一直卡住。

To Reproduce (复现方法)

TEST_F(ProxyRpcTest, test_consul_ns) {
  LOGV(LL_NOTICE, "hello consul");

  brpc::Channel* channel = new brpc::Channel();

  brpc::ChannelOptions options;
  options.protocol = "h2:grpc";

  std::string url = "consul://proxy";

  int ret = channel->Init(url.c_str(), "rr", &options);

  EXPECT_EQ(0, ret);

  delete channel;   // 在这里卡住了。

  LOGV(LL_NOTICE, "done!");
}

用gdb查看调用栈显示如下:

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000000000a2a412 in bthread::futex_wait_private (timeout=0x0, expected=0, addr1=0x7fff82443cf0) at ./src/bthread/sys_futex.h:37
#2  bthread::wait_pthread (pw=..., ptimeout=ptimeout@entry=0x0) at src/bthread/butex.cpp:138
#3  0x0000000000a2b8af in bthread::butex_wait_from_pthread (abstime=0x0, expected_value=1, b=0x2a5ce00, g=<optimized out>) at src/bthread/butex.cpp:585
#4  bthread::butex_wait (arg=0x2a5ce00, expected_value=expected_value@entry=1, abstime=abstime@entry=0x0) at src/bthread/butex.cpp:618
#5  0x00000000008c3c03 in bthread::TaskGroup::join (tid=<optimized out>, return_value=return_value@entry=0x0) at src/bthread/task_group.cpp:479
#6  0x00000000008cf5ba in bthread_join (tid=<optimized out>, thread_return=thread_return@entry=0x0) at src/bthread/bthread.cpp:244
#7  0x0000000000acf510 in brpc::NamingServiceThread::~NamingServiceThread (this=0x2a4cf80, __in_chrg=<optimized out>) at src/brpc/details/naming_service_thread.cpp:244
#8  0x0000000000acfa01 in brpc::NamingServiceThread::~NamingServiceThread (this=0x2a4cf80, __in_chrg=<optimized out>) at src/brpc/details/naming_service_thread.cpp:265
#9  0x000000000094e3b6 in brpc::SharedObject::RemoveRefManually (this=<optimized out>) at ./src/brpc/shared_object.h:50
#10 brpc::intrusive_ptr_release (obj=<optimized out>) at ./src/brpc/shared_object.h:65
#11 butil::intrusive_ptr<brpc::NamingServiceThread>::~intrusive_ptr (this=0x2a4cb98, __in_chrg=<optimized out>) at ./src/butil/intrusive_ptr.hpp:89
#12 brpc::LoadBalancerWithNaming::~LoadBalancerWithNaming (this=0x2a4cb20, __in_chrg=<optimized out>) at src/brpc/details/load_balancer_with_naming.cpp:22
#13 0x000000000094e401 in brpc::LoadBalancerWithNaming::~LoadBalancerWithNaming (this=0x2a4cb20, __in_chrg=<optimized out>) at src/brpc/details/load_balancer_with_naming.cpp:26
#14 0x00000000009211a6 in brpc::SharedObject::RemoveRefManually (this=<optimized out>) at ./src/brpc/shared_object.h:50
#15 brpc::intrusive_ptr_release (obj=<optimized out>) at ./src/brpc/shared_object.h:65
#16 butil::intrusive_ptr<brpc::SharedLoadBalancer>::~intrusive_ptr (this=0x2a25c38, __in_chrg=<optimized out>) at ./src/butil/intrusive_ptr.hpp:89
#17 brpc::Channel::~Channel (this=0x2a25c00, __in_chrg=<optimized out>) at src/brpc/channel.cpp:136
#18 0x0000000000921431 in brpc::Channel::~Channel (this=0x2a25c00, __in_chrg=<optimized out>) at src/brpc/channel.cpp:141
#19 0x00000000006fac54 in std::_Sp_counted_ptr<brpc::Channel*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x2a4d290) at /usr/include/c++/5/bits/shared_ptr_base.h:374
#20 0x00000000006f5ea2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x2a4d290) at /usr/include/c++/5/bits/shared_ptr_base.h:150
#21 0x00000000006f0e3f in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fff82444128, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:659
#22 0x00000000006f17c6 in std::__shared_ptr<brpc::Channel, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fff82444120, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr_base.h:925
#23 0x00000000006f17e2 in std::shared_ptr<brpc::Channel>::~shared_ptr (this=0x7fff82444120, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/shared_ptr.h:93
#24 0x00000000006f2816 in cim::test::RpcTest::CallRpc<cim::proto::P2PNotifyRequest, cim::proto::P2PNotifyResponse, cim::proto::ProxyService_Stub, void (cim::proto::ProxyService_Stub::*)(google::protobuf::RpcController*, cim::proto::P2PNotifyRequest const*, cim::proto::P2PNotifyResponse*, google::protobuf::Closure*)> (this=0x29f6b60, rpc_type=cim::rpc::RPReq, service_name="proxy", method_name="", req=0x7fff82444860, res=0x7fff82444830, sub=...,
    fp=&virtual table offset 368, uid=1, use_grpc=true) at /docker-v/ubuntu/my-projects/zixia/libs/yy-rpc/test/rpc_sender.h:290
#25 0x00000000006ecb80 in (anonymous namespace)::ProxyRpcTest_temp_Test::TestBody (this=0x29f6b60) at /docker-v/ubuntu/my-projects/zixia/server/proxy/test/proxy_functional_test.cc:220
#26 0x000000000089fb33 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0xc20a96 "the test body", method=<optimized out>, object=<optimized out>) at ./src/gtest.cc:2078
#27 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0x29f6b60, method=<optimized out>, location=location@entry=0xc20a96 "the test body") at ./src/gtest.cc:2114
#28 0x000000000089260d in testing::Test::Run (this=this@entry=0x29f6b60) at ./src/gtest.cc:2151
#29 0x00000000008926a4 in testing::TestInfo::Run (this=0x28c8c60) at ./src/gtest.cc:2326
#30 0x00000000008927a5 in testing::TestCase::Run (this=0x28c8520) at ./src/gtest.cc:2444
#31 0x0000000000892a1d in testing::internal::UnitTestImpl::RunAllTests (this=0x28c8210) at ./src/gtest.cc:4315
#32 0x0000000000892cfe in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=<optimized out>, method=<optimized out>, object=<optimized out>) at ./src/gtest.cc:2078
#33 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0xc21e48 "auxiliary test code (environments or event listeners)",
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x892810 <testing::internal::UnitTestImpl::RunAllTests()>, object=0x28c8210) at ./src/gtest.cc:2114
#34 testing::UnitTest::Run (this=<optimized out>) at ./src/gtest.cc:3929
#35 0x00000000006dfae0 in RUN_ALL_TESTS () at ./include/gtest/gtest.h:2288
#36 main (argc=1, argv=0x7fff82444b78) at src/gtest_main.cc:37

Expected behavior (期望行为)
channel析构的时候不要被卡住。

Versions (各种版本)
OS: ubuntu 16.04
Compiler: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
brpc:
protobuf:

Additional context/screenshots (更多上下文/截图)

@jamesge
Copy link
Contributor

jamesge commented Nov 26, 2018

使用其他protocol或命名服务会卡住么?

@brianjcj
Copy link
Author

brianjcj commented Nov 26, 2018

试过file和list, 不会卡。 比如这个就不会卡。

TEST_F(ProxyRpcTest, test_file_ns) {
  LOGV(LL_NOTICE, "hello file");

  brpc::Channel* channel = new brpc::Channel();

  brpc::ChannelOptions options;
  options.protocol = "h2:grpc";

  std::string url = "file:///docker-v/ubuntu/tmp/mt.list";

  int ret = channel->Init(url.c_str(), "rr", &options);

  EXPECT_EQ(0, ret);

  delete channel;

  LOGV(LL_NOTICE, "done!");
}

@jamesge
Copy link
Contributor

jamesge commented Nov 26, 2018

ok,这个issue我已经反馈给consul ns的贡献者了

@gydong
Copy link
Contributor

gydong commented Nov 27, 2018

@brianjcj 我试着用你的代码重现,但是没有重现出来。(OS: cnetos6.4; compiler: GCC4.8.2)
代码:
image

输出:
image

@brianjcj
Copy link
Author

brianjcj commented Nov 27, 2018

@gydong
如果不启动consul, 连接默认的 127.0.0.1:8500 (这个地址上没有启动consul服务), 我这里也不会卡住。跟你的结果一样:

[NOTICE][2018-11-27 10:04:22 732804][47731][rpc_sender.h:SetUpTestCase:61][========SetUpTestCase========]
[NOTICE][2018-11-27 10:04:22 733258][47731][rpc_sender.h:RpcTest:175][========RpcTest========]
[NOTICE][2018-11-27 10:04:22 733310][47731][rpc_sender.h:SetUp:222][========SetUp========]
[NOTICE][2018-11-27 10:04:22 733329][47731][proxy_functional_test.cc:TestBody:338][hello consul]
I1127 10:04:22.741934 47743 src/brpc/details/naming_service_thread.cpp:200] brpc::policy::DomainNamingService("127.0.0.1:8500"): added 1
E1127 10:04:22.743072 47743 src/brpc/policy/consul_naming_service.cpp:106] Fail to access /v1/health/service/proxy?stale&passing: [E111]Fail to connect Socket{id=1 addr=127.0.0.1:8500} (0x0x7fe2500391d0): Connection refused [R1][E112]Fail to select server from http://127.0.0.1:8500 lb=rr [R2][E112]Fail to select server from http://127.0.0.1:8500 lb=rr [R3][E112]Fail to select server from http://127.0.0.1:8500 lb=rr
W1127 10:04:22.743218 47731 src/brpc/details/naming_service_thread.cpp:303] `consul://proxy' is empty! RPC over the channel will fail until servers appear
[NOTICE][2018-11-27 10:04:22 743453][47731][proxy_functional_test.cc:TestBody:356][done!]
[NOTICE][2018-11-27 10:04:22 743486][47731][rpc_sender.h:TearDown:228][========TearDown========]
[NOTICE][2018-11-27 10:04:22 743505][47731][rpc_sender.h:~RpcTest:213][========~RpcTest========]
[NOTICE][2018-11-27 10:04:22 743582][47731][rpc_sender.h:TearDownTestCase:160][========TearDownTestCase========]

但是我启动了consul服务(http://172.25.42.17:8500), 并配置brpc使用这个这个consul服务之后, 就会卡住:

namespace brpc {
namespace policy {
DECLARE_string(consul_agent_addr);
}
}

class RpcTest : public ::testing::Test {
protected:
  void SetUp() override {
    // Code here will be called immediately after the constructor (right
    // before each test).
    brpc::policy::FLAGS_consul_agent_addr = "http://172.25.42.17:8500";
    LOGV(LL_NOTICE, "========SetUp========");
  }
};

程序输出:

[NOTICE][2018-11-27 10:12:16 739259][48348][rpc_sender.h:SetUpTestCase:61][========SetUpTestCase========]
[NOTICE][2018-11-27 10:12:16 739709][48348][rpc_sender.h:RpcTest:175][========RpcTest========]
[NOTICE][2018-11-27 10:12:16 739736][48348][rpc_sender.h:SetUp:222][========SetUp========]
[NOTICE][2018-11-27 10:12:16 739753][48348][proxy_functional_test.cc:TestBody:338][hello consul]
I1127 10:12:16.747703 48357 src/brpc/details/naming_service_thread.cpp:200] brpc::policy::DomainNamingService("172.25.42.17:8500"): added 1
W1127 10:12:16.749762 48348 src/brpc/details/naming_service_thread.cpp:303] `consul://proxy' is empty! RPC over the channel will fail until servers appear

@gydong
Copy link
Contributor

gydong commented Nov 27, 2018

重现了,我们看一下原因,多谢反馈。

@cdjingit
Copy link
Contributor

cdjingit commented Nov 27, 2018

原因是~NamingServiceThread 析构的时候 bthread_stop(_tid)后 这个consul ns的thread没有返回。
如下的分支没有判断线程被stop后返回的地方。一个方法是sleep 一会儿。
20181127184508

@cdjingit
Copy link
Contributor

cdjingit commented Nov 27, 2018

或者GetServers后判断 errno == EINTR,表示GetServers中的 callmethod join失败了,bthread退出

@brianjcj
Copy link
Author

@cdjingit FileNamingService::RunNamingService 的实现跟这个差不多, 为啥就不会卡呢?

@cdjingit
Copy link
Contributor

@brianjcj
filenaming ns总是有如下判断,stop bthread后,bthread_usleep会失败 返回,退出线程
if (bthread_usleep(100000L/100ms/) < 0) {
if (errno == ESTOP) {
return 0;
}
PLOG(ERROR) << "Fail to sleep";
return -1;
}

@zyearn
Copy link
Member

zyearn commented Nov 28, 2018

consul其实就是个PeriodicNamingService,只是sleep时间不同,这里需要改下抽象?如果统一成PeriodicNamingService,这个问题也就自然解决了

@cdjingit
Copy link
Contributor

cdjingit commented Nov 28, 2018

GetServers里面是long poolling的请求。循环里判断bthread_stopped(),决定退出是否可行?

@gydong
Copy link
Contributor

gydong commented Nov 28, 2018

#586

@yitian134
Copy link

围观群众也讨论下。。感觉在running里面检查自己bthread是否stop看起来有点奇怪。
类比的话那每个loop都要做check stop了,看起来还不如一个signal之类的处理。

@jamesge
Copy link
Contributor

jamesge commented Dec 6, 2018

@yitian134 这个就相当于内置了一个stop变量,用户也可以用自己的stop变量。基本的结束线程方式还是设置一个stop变量+打断一下

@jamesge
Copy link
Contributor

jamesge commented Dec 6, 2018

@brianjcj 请验证一下主干代码是否已解决你的问题

@brianjcj
Copy link
Author

可以了

@htner
Copy link

htner commented Aug 26, 2023

请教一下,这个问题还是没有修复对吗?

@chenBright
Copy link
Contributor

#586

修复了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants