Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash will crash if sending more than 2GB message through grpc #3785

Closed
windtalker opened this issue Dec 31, 2021 · 2 comments · Fixed by #3803
Closed

TiFlash will crash if sending more than 2GB message through grpc #3785

windtalker opened this issue Dec 31, 2021 · 2 comments · Fixed by #3803
Assignees
Labels
affects-5.3 severity/critical type/bug The issue is confirmed as a bug.

Comments

@windtalker
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

We use grpc + protobuf to send message between TiFlash nodes, however, protobuf has a hard limit of 2GB, so if the message size exceeds 2GB, we should throw out error. However, after #3184, there is a chance that
the size of MPPDataPacket exceeds 2GB, and this will cause TiFlash server crash. A typical crash stack is

[2021/12/28 19:54:34.978 +09:00] [ERROR] [<unknown>] ["grpc: /usr/local/include/grpcpp/impl/codegen/proto_buffer_writer.h, line number : 83, log msg : assertion failed: byte_count_ < total_size_"] [thread_id=237]
[2021/12/28 19:54:34.979 +09:00] [ERROR] [<unknown>] ["BaseDaemon: ########################################"] [thread_id=238]
[2021/12/28 19:54:34.979 +09:00] [ERROR] [<unknown>] ["BaseDaemon: (from thread 237) Received signal Aborted (6)."] [thread_id=238]
[2021/12/28 19:54:35.012 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 0. /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7) [0x7fc5233bff47]"] [thread_id=238]
[2021/12/28 19:54:35.012 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 1. /lib/x86_64-linux-gnu/libc.so.6(abort+0x141) [0x7fc5233c18b1]"] [thread_id=238]
[2021/12/28 19:54:35.012 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 2. bin/tiflash/tiflash() [0x86a7fb4]"] [thread_id=238]
[2021/12/28 19:54:35.012 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 3. bin/tiflash/tiflash(grpc::ProtoBufferWriter::Next(void**, int*)+0x19d) [0x82dd75d]"] [thread_id=238]
[2021/12/28 19:54:35.012 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 4. bin/tiflash/tiflash(google::protobuf::io::CodedOutputStream::Refresh()+0x1a) [0x898344a]"] [thread_id=238]
[2021/12/28 19:54:35.012 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 5. bin/tiflash/tiflash(google::protobuf::io::CodedOutputStream::CodedOutputStream(google::protobuf::io::ZeroCopyOutputStream*, bool)+0x41) [0x8983741]"] [thread_id=238]
[2021/12/28 19:54:35.012 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 6. bin/tiflash/tiflash(google::protobuf::MessageLite::SerializeToZeroCopyStream(google::protobuf::io::ZeroCopyOutputStream*) const+0x14) [0x8985b94]"] [thread_id=238]
[2021/12/28 19:54:35.012 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 7. bin/tiflash/tiflash(grpc::Status grpc::GenericSerialize<grpc::ProtoBufferWriter, mpp::MPPDataPacket>(google::protobuf::MessageLite const&, grpc::ByteBuffer*, bool*)+0x137) [0x8622097]"] [thread_id=238]
[2021/12/28 19:54:35.013 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 8. bin/tiflash/tiflash(std::_Function_handler<grpc::Status (void const*), grpc::Status grpc::internal::CallOpSendMessage::SendMessage<mpp::MPPDataPacket>(mpp::MPPDataPacket const&, grpc::WriteOptions)::{lambda(void const*)#1}>::_M_invoke(std::_Any_data const&, void const*&&)+0x53) [0x8622263]"] [thread_id=238]
[2021/12/28 19:54:35.013 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 9. bin/tiflash/tiflash(grpc::internal::CallOpSendMessage::AddOp(grpc_op*, unsigned long*)+0x7c) [0x832c68c]"] [thread_id=238]
[2021/12/28 19:54:35.013 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 10. bin/tiflash/tiflash(grpc::internal::CallOpSet<grpc::internal::CallOpSendInitialMetadata, grpc::internal::CallOpSendMessage, grpc::internal::CallNoOp<3>, grpc::internal::CallNoOp<4>, grpc::internal::CallNoOp<5>, grpc::internal::CallNoOp<6> >::ContinueFillOpsAfterInterception()+0x43) [0x86b42f3]"] [thread_id=238]
[2021/12/28 19:54:35.013 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 11. bin/tiflash/tiflash(grpc_impl::ServerWriter<mpp::MPPDataPacket>::Write(mpp::MPPDataPacket const&, grpc::WriteOptions)+0x127) [0x858cfe7]"] [thread_id=238]
[2021/12/28 19:54:35.013 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 12. bin/tiflash/tiflash(DB::MPPTunnelBase<grpc_impl::ServerWriter<mpp::MPPDataPacket> >::sendLoop()+0x128) [0x7d44668]"] [thread_id=238]
[2021/12/28 19:54:35.013 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 13. bin/tiflash/tiflash() [0x8b729ef]"] [thread_id=238]
[2021/12/28 19:54:35.013 +09:00] [ERROR] [<unknown>] ["BaseDaemon: 14. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fc523b176db]"] [thread_id=238]

The root cause of generating such a huge MPPDataPacket is #3436, but before #3436 is fixed, we should at least throw out error explicitly instead of make TiFlash crash.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

@windtalker windtalker added the type/bug The issue is confirmed as a bug. label Dec 31, 2021
@ilovesoup
Copy link
Contributor

What do you exactly mean by "there is a chance"? @windtalker

@windtalker
Copy link
Contributor Author

According to #3436, the size of block generated by Join expression is actually unlimited, and if join generate huge block(for example: using wrong query plan for tpch q5), before #3184, TiFlash will throw error after #3184, it will be crashed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.3 severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants