Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRpc No match found for server name cause TiFlash crash #8535

Open
aytrack opened this issue Dec 15, 2023 · 6 comments · Fixed by #9068
Open

GRpc No match found for server name cause TiFlash crash #8535

aytrack opened this issue Dec 15, 2023 · 6 comments · Fixed by #9068

Comments

@aytrack
Copy link

aytrack commented Dec 15, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

[2023/12/13 03:09:58.571 +00:00] [ERROR] [MPPTask.cpp:469] ["task running meets error: Code: 49, e.displayText() = DB::Exception: Check tunnel != nullptr failed: tunnel4+10: Can't find task [446283641820020850,4] within 10s., e.what() = DB::Exception, Stack trace:\n\n\n       0x1c54d3e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29707582]\n                \tdbms/src/Common/Exception.h:46\n       0x82636b6\tDB::GRPCReceiverContext::establishMPPConnectionLocalV2(DB::ExchangeRecvRequest const&, unsigned long, DB::LocalRequestHandler&, bool, bool) [tiflash+136722102]\n                \tdbms/src/Flash/Mpp/GRPCReceiverContext.cpp:296\n       0x1c14f84\tDB::ExchangeReceiverBase<DB::GRPCReceiverContext>::ExchangeReceiverBase(std::__1::shared_ptr<DB::GRPCReceiverContext>, unsigned long, unsigned long, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long, int, std::__1::vector<std::__1::tuple<std::__1::shared_ptr<mpp::DispatchTaskRequest>, std::__1::vector<pingcap::kv::RegionVerID, std::__1::allocator<pingcap::kv::RegionVerID> >, unsigned long>, std::__1::allocator<std::__1::tuple<std::__1::shared_ptr<mpp::DispatchTaskRequest>, std::__1::vector<pingcap::kv::RegionVerID, std::__1::allocator<pingcap::kv::RegionVerID> >, unsigned long> > > const&) [tiflash+29446020]\n                \tdbms/src/Flash/Mpp/ExchangeReceiver.cpp:338\n       0x827e991\tstd::__1::__function::__func<DB::MPPTask::initExchangeReceivers()::$_18, std::__1::allocator<DB::MPPTask::initExchangeReceivers()::$_18>, bool (tipb::Executor const&)>::operator()(tipb::Executor const&) [tiflash+136833425]\n                \t/usr/local/bin/../include/c++/v1/__functional/function.h:345\n       0x8189682\tvoid DB::traverseExecutorTree<std::__1::function<bool (tipb::Executor const&)>&>(tipb::Executor const&, std::__1::function<bool (tipb::Executor const&)>&)::'lambda'(tipb::Executor const&)::operator()(tipb::Executor const&) const [tiflash+135829122]\n                \tdbms/src/Flash/Statistics/traverseExecutors.h:79\n       0x8189186\tvoid DB::traverseExecutors<std::__1::function<bool (tipb::Executor const&)> >(tipb::DAGRequest const*, std::__1::function<bool (tipb::Executor const&)>&&) [tiflash+135827846]\n                \tdbms/src/Flash/Statistics/traverseExecutors.h:94\n       0x8279898\tDB::MPPTask::runImpl() [tiflash+136812696]\n                \tdbms/src/Flash/Mpp/MPPTask.cpp:389\n       0x1d60168\tauto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::'lambda'()::operator()() [tiflash+30802280]\n                \tdbms/src/Common/wrapInvocable.h:36\n       0x1c64c53\tDB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&) [tiflash+29772883]\n                \tdbms/src/Common/DynamicThreadPool.cpp:101\n       0x1c642b4\tDB::DynamicThreadPool::fixedWork(unsigned long) [tiflash+29770420]\n                \tdbms/src/Common/DynamicThreadPool.cpp:115\n       0x1c657c2\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::*)(unsigned long), DB::DynamicThreadPool*, unsigned long&>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, void (DB::DynamicThreadPool::*&&)(unsigned long), DB::DynamicThreadPool*&&, unsigned long&)::'lambda'(auto&&...), DB::DynamicThreadPool*, unsigned long> >(void*) [tiflash+29775810]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f1b09d5a802\tstart_thread [libc.so.6+653314]\n  0x7f1b09cfa450\t__GI___clone3 [libc.so.6+259152]"] [source="MPP<query:<query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>,task_id:10>"] [thread_id=5707]
[2023/12/13 03:09:58.571 +00:00] [WARN] [MPPTaskManager.cpp:155] ["Begin to abort query: <query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>, abort type: ONERROR, reason: From MPP<query:<query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>,task_id:10>: Code: 49, e.displayText() = DB::Exception: Check tunnel != nullptr failed: tunnel4+10: Can't find task [446283641820020850,4] within 10s., e.what() = DB::Exception,"] [thread_id=5707]
[2023/12/13 03:09:58.571 +00:00] [WARN] [MPPTask.cpp:511] ["Begin abort task: MPP<query:<query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>,task_id:10>, abort type: ONERROR"] [source="MPP<query:<query_ts:1702436988566420096, local_query_id:18097, server_id:1085732, start_ts:446283641820020850>,task_id:10>"] [thread_id=5707]
[2023/12/13 03:10:47.239 +00:00] [ERROR] [MPPTask.cpp:469] ["task running meets error: Code: 0, e.displayText() = DB::Exception: tunnel4+-1 is timeout, e.what() = DB::Exception, Stack trace:\n\n\n       0x1c54d3e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29707582]\n                \tdbms/src/Common/Exception.h:46\n       0x829220e\tDB::MPPTunnel::waitUntilConnectedOrFinished(std::__1::unique_lock<std::__1::mutex>&) [tiflash+136913422]\n                \tdbms/src/Flash/Mpp/MPPTunnel.cpp:343\n       0x8291918\tDB::MPPTunnel::write(std::__1::shared_ptr<DB::TrackedMppDataPacket>&&) [tiflash+136911128]\n                \tdbms/src/Flash/Mpp/MPPTunnel.cpp:163\n       0x1c2406b\tDB::MPPTunnelSetBase<DB::MPPTunnel>::sendExecutionSummary(tipb::SelectResponse const&) [tiflash+29507691]\n                \tdbms/src/Flash/Mpp/MPPTunnelSet.cpp:39\n       0x827af34\tDB::MPPTask::runImpl() [tiflash+136818484]\n                \tdbms/src/Flash/Mpp/MPPTask.cpp:421\n       0x1d60168\tauto DB::wrapInvocable<std::__1::function<void ()> >(bool, std::__1::function<void ()>&&)::'lambda'()::operator()() [tiflash+30802280]\n                \tdbms/src/Common/wrapInvocable.h:36\n       0x1c64c53\tDB::DynamicThreadPool::executeTask(std::__1::unique_ptr<DB::IExecutableTask, std::__1::default_delete<DB::IExecutableTask> >&) [tiflash+29772883]\n                \tdbms/src/Common/DynamicThreadPool.cpp:101\n       0x1c642b4\tDB::DynamicThreadPool::fixedWork(unsigned long) [tiflash+29770420]\n                \tdbms/src/Common/DynamicThreadPool.cpp:115\n       0x1c657c2\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<void (DB::DynamicThreadPool::*)(unsigned long), DB::DynamicThreadPool*, unsigned long&>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, void (DB::DynamicThreadPool::*&&)(unsigned long), DB::DynamicThreadPool*&&, unsigned long&)::'lambda'(auto&&...), DB::DynamicThreadPool*, unsigned long> >(void*) [tiflash+29775810]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f1b09d5a802\tstart_thread [libc.so.6+653314]\n  0x7f1b09cfa450\t__GI___clone3 [libc.so.6+259152]"] [source="MPP<query:<query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>,task_id:4>"] [thread_id=6468]
[2023/12/13 03:10:47.239 +00:00] [WARN] [MPPTaskManager.cpp:155] ["Begin to abort query: <query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>, abort type: ONERROR, reason: From MPP<query:<query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>,task_id:4>: Code: 0, e.displayText() = DB::Exception: tunnel4+-1 is timeout, e.what() = DB::Exception,"] [thread_id=6468]
[2023/12/13 03:10:47.239 +00:00] [WARN] [MPPTask.cpp:511] ["Begin abort task: MPP<query:<query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>,task_id:4>, abort type: ONERROR"] [source="MPP<query:<query_ts:1702436987230751199, local_query_id:17967, server_id:1085732, start_ts:446283641570984098>,task_id:4>"] [thread_id=6468]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5804) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x7f1970bfbb"] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:413] ["Access: read."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.597 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.600 +00:00] [ERROR] [Server.cpp:379] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: tc-tiflash-10.tc-tiflash-peer.comp-tiflash-tps-5131024-1-679.svc."] [source=grpc] [thread_id=5852]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:569] ["\n       0x7554c51\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+123030609]\n                \tlibs/libdaemon/src/BaseDaemon.cpp:220\n  0x7f1b09d0fdb0\t<unknown symbol> [libc.so.6+347568]\n       0x8ccee29\tbio_ctrl [tiflash+147648041]\n                \tcontrib/boringssl/crypto/bio/pair.c:411\n       0x8ccd3f5\tBIO_pending [tiflash+147641333]\n                \tcontrib/boringssl/crypto/bio/bio.c:312\n       0x8c586ac\tssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+147162796]\n                \tcontrib/grpc/src/core/tsi/ssl_transport_security.cc:1568\n       0x8b0568b\tgrpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+145774219]\n                \tcontrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463\n       0x8aa3141\tpollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+145371457]\n                \tcontrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139\n       0x8b20fae\tcq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+145887150]\n                \tcontrib/grpc/src/core/lib/surface/completion_queue.cc:1047\n       0x1e8af99\tDB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+32026521]\n                \tdbms/src/Server/FlashGrpcServerHolder.cpp:52\n       0x1e8ab7d\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+32025469]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f1b09d5a802\tstart_thread [libc.so.6+653314]"] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5852) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:413] ["Access: read."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.604 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=6514]
[2023/12/13 04:11:31.609 +00:00] [ERROR] [BaseDaemon.cpp:569] ["\n       0x7554c51\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+123030609]\n                \tlibs/libdaemon/src/BaseDaemon.cpp:220\n  0x7f1b09d0fdb0\t<unknown symbol> [libc.so.6+347568]\n       0x8c57503\tserver_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+147158275]\n                \tcontrib/grpc/src/core/tsi/ssl_transport_security.cc:1884\n       0x8c803ae\tbssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+147325870]\n                \tcontrib/boringssl/ssl/extensions.cc:1570\n       0x8cbf426\tbssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+147584038]\n                \tcontrib/boringssl/ssl/tls13_server.cc:1242\n       0x8c9e8dc\tbssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+147450076]\n                \tcontrib/boringssl/ssl/handshake_server.cc:1835\n       0x8c9018a\tbssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+147390858]\n                \tcontrib/boringssl/ssl/handshake.cc:738\n       0x8caae2a\tSSL_do_handshake [tiflash+147500586]\n                \tcontrib/boringssl/ssl/ssl_lib.cc:841\n       0x8c58680\tssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+147162752]\n                \tcontrib/grpc/src/core/tsi/ssl_transport_security.cc:1568\n       0x8b0568b\tgrpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+145774219]\n                \tcontrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463\n       0x8aa3141\tpollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+145371457]\n                \tcontrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139\n       0x8b20fae\tcq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+145887150]\n                \tcontrib/grpc/src/core/lib/surface/completion_queue.cc:1047\n       0x1e8af99\tDB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+32026521]\n                \tdbms/src/Server/FlashGrpcServerHolder.cpp:52\n       0x1e8c05d\tvoid* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+32030813]\n                \t/usr/local/bin/../include/c++/v1/thread:291\n  0x7f1b09d5a802\tstart_thread [libc.so.6+653314]"] [source=BaseDaemon] [thread_id=6514]

similar case #7024 updated: not similar

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

v7.1.3

@aytrack aytrack added the type/bug The issue is confirmed as a bug. label Dec 15, 2023
@Lloyd-Pottiger Lloyd-Pottiger self-assigned this Jan 10, 2024
@Lloyd-Pottiger Lloyd-Pottiger changed the title tiflash crash with Received signal Segmentation fault GRpc No match found for server name cause TiFlash crash Jan 11, 2024
@gengliqi
Copy link
Contributor

Another same case from one customer

[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.237 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 4908) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.234 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=258943]
[2024/03/08 16:38:22.254 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
       0x68b831c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
                    libs/libdaemon/src/BaseDaemon.cpp:220
  0xffffbbe3881c    <unknown symbol> [linux-vdso.so.1+2076]
       0x7d59d20    server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131439904]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
       0x7d80490    bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131597456]
                    contrib/boringssl/ssl/extensions.cc:1570
       0x7dbae28    bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131837480]
                    contrib/boringssl/ssl/tls13_server.cc:1242
       0x7d9d050    bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131715152]
                    contrib/boringssl/ssl/handshake_server.cc:1835
       0x7d90104    bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131662084]
                    contrib/boringssl/ssl/handshake.cc:738
       0x7da810c    SSL_do_handshake [tiflash+131760396]
                    contrib/boringssl/ssl/ssl_lib.cc:841
       0x7d5ae5c    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444316]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
       0x7c21db0    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7bbf500    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7c3afb8    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x1b0e8c0    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b0f6fc    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28374780]
                    /usr/local/bin/../include/c++/v1/thread:291
  0xffffb7da6a28    start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=258943]

@JaySon-Huang
Copy link
Contributor

Seems this is an issue with the ssl lib. And we can't locate the root cause of it so far.

@JaySon-Huang
Copy link
Contributor

Another similar case observed

[2024/03/26 16:51:20.125 +00:00] [ERROR] [Server.cpp:379] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-3.db-tiflash-peer.tidb1379xxx5254.svc."] [source=grpc] [thread_id=5588]
[2024/03/26 16:51:20.125 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5588) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.127 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=11240]
  0xffff94bc0a28    start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=11240]
                    /usr/local/bin/../include/c++/v1/thread:291
       0x1b0f6fc    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28374780]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b0e8c0    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x7c3afb8    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7bbf500    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7c21db0    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
       0x7d5ae5c    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444316]
                    contrib/boringssl/ssl/ssl_lib.cc:841
       0x7da810c    SSL_do_handshake [tiflash+131760396]
                    contrib/boringssl/ssl/handshake.cc:738
       0x7d90104    bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131662084]
                    contrib/boringssl/ssl/handshake_server.cc:1835
       0x7d9d050    bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131715152]
                    contrib/boringssl/ssl/tls13_server.cc:1242
       0x7dbae28    bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131837480]
                    contrib/boringssl/ssl/extensions.cc:1570
       0x7d80490    bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131597456]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
       0x7d59d20    server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131439904]
  0xffff98c52800    <unknown symbol> [linux-vdso.so.1+2048]
                    libs/libdaemon/src/BaseDaemon.cpp:220
       0x68b831c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 5322) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0xfff52c58f2"] [source=BaseDaemon] [thread_id=11240]
[2024/03/26 16:51:20.141 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=11240]
  0xffff94bc0a28    start_thread [libc.so.6+535080]"] [source=BaseDaemon] [thread_id=11240]
                    /usr/local/bin/../include/c++/v1/thread:291
       0x1b0e444    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+28369988]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b0e8c0    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28371136]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x7c3afb8    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130265016]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7bbf500    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129758464]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7c21db0    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130162096]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1576
       0x7d5aee8    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131444456]
                    contrib/boringssl/crypto/bio/pair.c:158
       0x7dc976c    bio_read [tiflash+131897196]
  0xffff98c52800    <unknown symbol> [linux-vdso.so.1+2048]
                    libs/libdaemon/src/BaseDaemon.cpp:220
       0x68b831c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109806364]
[2024/03/26 16:51:20.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["

@JaySon-Huang
Copy link
Contributor

Each time we can see a similar logging like .../grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-0.db-tiflash-peer.tidb?????????.svc."]

[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
       0x68c2c3c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109849660]
                    libs/libdaemon/src/BaseDaemon.cpp:220
  0xffffb0044820    <unknown symbol> [linux-vdso.so.1+2080]
       0x7de1824    bio_ctrl [tiflash+131995684]
                    contrib/boringssl/crypto/bio/pair.c:411
       0x7ddfb3c    BIO_pending [tiflash+131988284]
                    contrib/boringssl/crypto/bio/bio.c:312
       0x7d72d94    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131542420]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
       0x7c39cbc    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130260156]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7bd740c    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129856524]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7c52ec4    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130363076]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x1b14598    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28394904]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b153d4    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_6&&)::'lambda'(auto&&...)> >(void*) [tiflash+28398548]
                    /usr/local/bin/../include/c++/v1/thread:291
  0xffffabfb2a38    start_thread [libc.so.6+535096]"] [source=BaseDaemon] [thread_id=299017]Show context
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0xffdf201318"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 2417) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.142 +00:00] [ERROR] [BaseDaemon.cpp:569] ["
       0x68c2c3c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+109849660]
                    libs/libdaemon/src/BaseDaemon.cpp:220
  0xffffb0044820    <unknown symbol> [linux-vdso.so.1+2080]
       0x7d71c30    server_handshaker_factory_alpn_callback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) [tiflash+131537968]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1884
       0x7d983a0    bssl::ssl_negotiate_alpn(bssl::SSL_HANDSHAKE*, unsigned char*, ssl_early_callback_ctx const*) [tiflash+131695520]
                    contrib/boringssl/ssl/extensions.cc:1570
       0x7dd2d38    bssl::tls13_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131935544]
                    contrib/boringssl/ssl/tls13_server.cc:1242
       0x7db4f60    bssl::ssl_server_handshake(bssl::SSL_HANDSHAKE*) [tiflash+131813216]
                    contrib/boringssl/ssl/handshake_server.cc:1835
       0x7da8014    bssl::ssl_run_handshake(bssl::SSL_HANDSHAKE*, bool*) [tiflash+131760148]
                    contrib/boringssl/ssl/handshake.cc:738
       0x7dc001c    SSL_do_handshake [tiflash+131858460]
                    contrib/boringssl/ssl/ssl_lib.cc:841
       0x7d72d6c    ssl_handshaker_next(tsi_handshaker*, unsigned char const*, unsigned long, unsigned char const**, unsigned long*, tsi_handshaker_result**, void (*)(tsi_result, void*, unsigned char const*, unsigned long, tsi_handshaker_result*), void*) [tiflash+131542380]
                    contrib/grpc/src/core/tsi/ssl_transport_security.cc:1568
       0x7c39cbc    grpc_core::(anonymous namespace)::SecurityHandshaker::OnHandshakeDataReceivedFromPeerFn(void*, grpc_error*) [tiflash+130260156]
                    contrib/grpc/src/core/lib/security/transport/security_handshaker.cc:463
       0x7bd740c    pollset_work(grpc_pollset*, grpc_pollset_worker**, long) [tiflash+129856524]
                    contrib/grpc/src/core/lib/iomgr/ev_epollex_linux.cc:1139
       0x7c52ec4    cq_next(grpc_completion_queue*, gpr_timespec, void*) [tiflash+130363076]
                    contrib/grpc/src/core/lib/surface/completion_queue.cc:1047
       0x1b14598    DB::(anonymous namespace)::handleRpcs(grpc::ServerCompletionQueue*, std::__1::shared_ptr<DB::Logger> const&) [tiflash+28394904]
                    dbms/src/Server/FlashGrpcServerHolder.cpp:52
       0x1b1411c    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::thread DB::ThreadFactory::newThread<DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5>(bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, DB::FlashGrpcServerHolder::FlashGrpcServerHolder(DB::Context&, Poco::Util::LayeredConfiguration&, DB::TiFlashRaftConfig const&, std::__1::shared_ptr<DB::Logger> const&)::$_5&&)::'lambda'(auto&&...)> >(void*) [tiflash+28393756]
                    /usr/local/bin/../include/c++/v1/thread:291
  0xffffabfb2a38    start_thread [libc.so.6+535096]"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:422] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:407] ["Address: 0x4400"] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:377] ["(from thread 2787) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [BaseDaemon.cpp:376] [########################################] [source=BaseDaemon] [thread_id=299017]
[2024/05/19 22:30:24.132 +00:00] [ERROR] [Server.cpp:379] ["/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/tsi/ssl_transport_security.cc, line number: 1874, log msg : No match found for server name: db-tiflash-0.db-tiflash-peer.tidb?????????.svc."] [source=grpc] [thread_id=2787]

@windtalker
Copy link
Contributor

The issue is introduced by #6346, and the root cause is there is data race in grpc-core when update ssh cert online: grpc/grpc#22647 (review)

@gengliqi
Copy link
Contributor

gengliqi commented May 22, 2024

The bug has been located in gRPC core. See grpc/grpc#36693.

@gengliqi gengliqi reopened this May 22, 2024
@jebter jebter added the impact/crash crash/fatal label Jun 3, 2024
@ti-chi-bot ti-chi-bot bot added report/customer Customers have encountered this bug. affects-8.5 labels Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants