Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiKV panics when the size of response exceeds 4GB #9012

Closed
youjiali1995 opened this issue Nov 11, 2020 · 11 comments · Fixed by #10971
Closed

TiKV panics when the size of response exceeds 4GB #9012

youjiali1995 opened this issue Nov 11, 2020 · 11 comments · Fixed by #10971
Assignees
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. component/gRPC Component: gRPC difficulty/medium Medium task. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/major type/bug The issue is confirmed as a bug.

Comments

@youjiali1995
Copy link
Contributor

Bug Report

What version of TiKV are you using?

3.x, 4.x, master

What operating system and CPU are you using?

not related

Steps to reproduce

Deploy the TiKV in #9010 which builds a 6GB result of mvcc_key and send mvcc_key request to it.

What did you expect?

The request is canceled or finished but TiKV is alive.

What did happened?

TiKV panics.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f120c3e0859 in __GI_abort () at abort.c:79
#2  0x00005630b1749ebb in grpc_core::SliceBufferByteStream::SliceBufferByteStream (this=0x7f11fdafd718, slice_buffer=0x7f11fda98e58, flags=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.6.0/grpc/src/core/lib/transport/byte_stream.cc:40
#3  0x00005630b173da4e in grpc_core::ManualConstructor<grpc_core::SliceBufferByteStream>::Init<grpc_slice_buffer*, unsigned int&> (this=0x7f11fdafd718)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.6.0/grpc/src/core/lib/gprpp/manual_constructor.h:195
#4  call_start_batch (call=call@entry=0x7f11fdafd060, ops=ops@entry=0x7f11adacd1f0, nops=nops@entry=3, notify_tag=notify_tag@entry=0x7f11ab8df3c0, is_notify_tag_closure=is_notify_tag_closure@entry=0)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.6.0/grpc/src/core/lib/surface/call.cc:1701
#5  0x00005630b173f764 in grpc_call_start_batch (call=0x7f11fdafd060, ops=0x7f11adacd1f0, nops=3, tag=0x7f11ab8df3c0, reserved=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.6.0/grpc/src/core/lib/surface/call.cc:1974
#6  0x00005630b1758a11 in grpcwrap_call_send_status_from_server (call=0x7f11fdafd060, ctx=0x7f11fda9d9e0, status_code=GRPC_STATUS_OK, status_details=<optimized out>, status_details_len=<optimized out>, trailing_metadata=0x0,
    send_empty_initial_metadata=1,
    optional_send_buffer=0x7f0960200600 "\032\266\365\352\327\t\022\202\002\"\377\001\377\020\255#\325\340\251\020x\354>2\006\214\312O\374\257\344\243\324^\016g \335 `\003\254\350\354\r\225\356`k\327\022U\201\372\215\022\001\r\333\356\211\340>\362u\245L\222\217\027\271\"\230\346M,\264\203\350\234y\003v\363]\271\313\367\262\333}\221Kh}<\364\256\224\224\235\250\244*'\273v\016\320\032vѬ", optional_send_buffer_len=6895090364, write_flags=0, tag=0x7f11ab8df3c0)
    at grpc_wrap.cc:714
#7  0x00005630b0604c63 in grpcio::call::Call::start_send_status_from_server::{{closure}} (ctx=0x7f11fda9d9e0, tag=0x7f11ab8df3c0) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/call/mod.rs:367
#8  grpcio::call::check_run (bt=grpcio::task::promise::BatchType::Finish, f=...) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/call/mod.rs:268
#9  grpcio::call::Call::start_send_status_from_server (self=0x7f11adacd4f0, status=0x7f11adacd5f0, send_empty_metadata=<optimized out>, payload=<optimized out>, write_flags=<optimized out>)
    at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/call/mod.rs:361
#10 0x00005630b016c4ae in grpcio::call::server::UnarySink<T>::success (self=..., t=...) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/call/server.rs:346
#11 0x00005630b03c0915 in <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll (self=..., cx=0x7f11adace190) at /home/jenkins/agent/workspace/tikv_ghpr_build_release/tikv/src/server/service/kv.rs:131
#12 0x00005630b060d33b in grpcio::task::executor::poll (task=..., woken=<optimized out>) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/task/executor.rs:201
#13 0x00005630b060b255 in grpcio::task::executor::resolve (task=..., success=<optimized out>) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/task/executor.rs:141
#14 grpcio::task::CallTag::resolve (self=..., cq=0x7f11adace2d0, success=<optimized out>) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/task/mod.rs:179
#15 grpcio::env::poll_queue (tx=...) at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/env.rs:30
#16 grpcio::env::EnvBuilder::build::{{closure}} () at /rust/registry/src/github.com-1ecc6299db9ec823/grpcio-0.6.0/src/env.rs:84
#17 std::sys_common::backtrace::__rust_begin_short_backtrace (f=...) at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/std/src/sys_common/backtrace.rs:125
#18 0x00005630b060a795 in std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} () at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/std/src/thread/mod.rs:470
#19 <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (self=..., _args=<optimized out>) at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/std/src/panic.rs:308
#20 std::panicking::try::do_call (data=<optimized out>) at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/std/src/panicking.rs:381
#21 std::panicking::try (f=...) at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/std/src/panicking.rs:345
#22 std::panic::catch_unwind (f=...) at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/std/src/panic.rs:382
#23 std::thread::Builder::spawn_unchecked::{{closure}} () at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/std/src/thread/mod.rs:469
#24 core::ops::function::FnOnce::call_once{{vtable-shim}} () at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/core/src/ops/function.rs:227
#25 0x00005630b0b3d9aa in std::sys::unix::thread::Thread::new::thread_start () at /rustc/b1496c6e606dd908dd651ac2cce89815e10d7fc5/library/alloc/src/boxed.rs:1042
#26 0x00007f120c731609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#27 0x00007f120c4dd293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
@youjiali1995 youjiali1995 added type/bug The issue is confirmed as a bug. severity/critical labels Nov 11, 2020
@youjiali1995 youjiali1995 added the component/gRPC Component: gRPC label Nov 11, 2020
@BusyJay
Copy link
Member

BusyJay commented Nov 11, 2020

It's by design that grpc can't send message larger than 4GB. Though it should report error instead of core dump.

To solve this, maybe we should change the rpc to server streaming call or add paging logic.

@youjiali1995
Copy link
Contributor Author

We can avoid this panic from the TiKV side, but it's better gRPC can report errors.

@BusyJay
Copy link
Member

BusyJay commented Nov 17, 2020

I'm afraid this is not possible ATM for latest gRPC. The binary will either core dump or panic inside rust-protobuf. See also stepancheg/rust-protobuf#530.

@youjiali1995
Copy link
Contributor Author

Lower its severity since it rarely occurs.

@Connor1996
Copy link
Member

Encounter it again https://asktug.com/t/topic/69426

@BusyJay
Copy link
Member

BusyJay commented May 6, 2021

Now that we have been using the forked version of protobuf, we can also add such protection inside the forked.

@BusyJay BusyJay added difficulty/medium Medium task. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels May 6, 2021
BusyJay added a commit to BusyJay/tikv that referenced this issue Sep 22, 2021
gRPC can't handle messages larger than 4GiB. This PR solves the issue by
checking response's binary length during serializing. Before the change,
TiKV will either coredump in grpc or panic in protobuf, after the
change, it will print a log in the TiKV side and call will be cancel in
the client side.

Close tikv#9012

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
ti-chi-bot added a commit that referenced this issue Sep 26, 2021
* change protobuf and update logs

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

* server: tolerate large response

gRPC can't handle messages larger than 4GiB. This PR solves the issue by
checking response's binary length during serializing. Before the change,
TiKV will either coredump in grpc or panic in protobuf, after the
change, it will print a log in the TiKV side and call will be cancel in
the client side.

Close #9012

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

* fix format

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@BusyJay
Copy link
Member

BusyJay commented Sep 26, 2021

After the fix, TiKV will log the failure instead of panicking.

v01dstar pushed a commit to v01dstar/tikv that referenced this issue Oct 26, 2021
* change protobuf and update logs

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

* server: tolerate large response

gRPC can't handle messages larger than 4GiB. This PR solves the issue by
checking response's binary length during serializing. Before the change,
TiKV will either coredump in grpc or panic in protobuf, after the
change, it will print a log in the TiKV side and call will be cancel in
the client side.

Close tikv#9012

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

* fix format

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@youjiali1995 youjiali1995 added affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. labels Nov 24, 2021
@youjiali1995 youjiali1995 added affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. labels Nov 24, 2021
@Connor1996
Copy link
Member

Encounter it again, do you have any idea when the response size would be so large without a large region. @youjiali1995 @BusyJay

@BusyJay
Copy link
Member

BusyJay commented Feb 17, 2022

You mean panic? What version did you use?

Connor1996 pushed a commit to Connor1996/tikv that referenced this issue Feb 17, 2022
* change protobuf and update logs

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

* server: tolerate large response

gRPC can't handle messages larger than 4GiB. This PR solves the issue by
checking response's binary length during serializing. Before the change,
TiKV will either coredump in grpc or panic in protobuf, after the
change, it will print a log in the TiKV side and call will be cancel in
the client side.

Close tikv#9012

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

* fix format

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@Connor1996
Copy link
Member

@BusyJay It doesn't panic, but the query can't succeed. After investigation, it's confirmed as the chunk codec(arrow codec) would reserve space for the fixed-length field even if the value is null. So when there are lots of fixed-length fields filling with null, the size of copr resp would be amplified by multiple times and exceed 4GB.

@BusyJay
Copy link
Member

BusyJay commented Feb 21, 2022

/cc @coocood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. component/gRPC Component: gRPC difficulty/medium Medium task. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants