Segmentation fault in tritonserver when I unload a model that uses implicit state management immediately after I infer the model #6594

grzywada · 2023-11-17T09:05:22Z

Description
If I load a PyTorch model that uses implicit state management with a sufficiently large state, immediately run a properly-terminated sequence of requests (the sequence can be length 1), and then immediately request a model unload, the tritonserver crashes with segmentation fault.

Triton Information
Triton container v23.09 with these flags: --shm-size 1g --ulimit stack=67108864 --rm --model-control-mode=explicit --disable-auto-complete-config

Are you using the Triton container or did you build it yourself?

To Reproduce
Here is an example sequence batching config that triggers the segfault.
sequence_batching {
max_sequence_idle_microseconds: 50000000
state [ {
input_name: "state__4"
output_name: "state__1"
data_type: TYPE_FP32
dims: [ 1000, 1, 150000 ]
initial_state: {
data_type: TYPE_FP32
dims: [ 1000, 1, 150000 ]
zero_data: true
name: "initial state"
}
} ]
}
The inputs and outputs are all small and shouldn’t contribute to the issue. If I change the first dimension of the state to be size 100 rather than 1000, I don’t get the segfault anymore, so the large state size is partly to blame.
The fault does not occur if I add a delay of a fraction of a second in between the inference and the unload. I’m guessing there’s a timing bug in the interaction of some background thread and the model unloading logic.
Here is the trace from the tritonserver:

I1113 18:22:27.807515 1 grpc_server.cc:149] Process for RepositoryModelUnload, rpc_ok=1, 0 step START
I1113 18:22:27.807567 1 grpc_server.cc:101] Ready for RPC 'RepositoryModelUnload', 1
I1113 18:22:27.807884 1 model_lifecycle.cc:382] AsyncUnload() 'mymodel'
I1113 18:22:27.807993 1 model_lifecycle.cc:286] VersionStates() 'mymodel'
I1113 18:22:27.808021 1 sequence_batch_scheduler.cc:1133] Cleaning-up resources on sequence-batch clean-up thread...
I1113 18:22:27.808039 1 sequence_batch_scheduler.cc:1137] Stopping sequence-batch clean-up thread...
I1113 18:22:27.808741 1 grpc_server.cc:149] Process for RepositoryModelUnload, rpc_ok=1, 0 step WRITEREADY
I1113 18:22:27.808822 1 grpc_server.cc:149] Process for RepositoryModelUnload, rpc_ok=1, 0 step COMPLETE
I1113 18:22:27.808820 1 sequence_batch_scheduler.cc:1108] Stopping sequence-batch reaper thread...
I1113 18:22:27.808867 1 grpc_server.cc:342] Done for RepositoryModelUnload, 0
I1113 18:22:27.808942 1 dynamic_batch_scheduler.cc:432] Stopping dynamic-batcher thread for mymodel...
I1113 18:22:27.809019 1 pinned_memory_manager.cc:191] non-pinned memory deallocation: addr 0x7ff5f044b010
I1113 18:22:27.810580 1 pinned_memory_manager.cc:191] non-pinned memory deallocation: addr 0x7ff52a44b010
I1113 18:22:27.812046 1 pinned_memory_manager.cc:191] non-pinned memory deallocation: addr 0x7ff66244b010
[ff5ea7f7b42d:1 :0:122] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
I1113 18:22:27.928306 1 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7ff882000090
I1113 18:22:27.928355 1 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7ff88229c020
I1113 18:22:27.928366 1 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7ff88229c050
I1113 18:22:27.928376 1 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7ff88229d000
==== backtrace (tid: 122) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000019a3df triton::core::TritonModelInstance::TritonBackendThread::StopBackendThread() :0
2 0x00000000001a0438 triton::core::TritonModelInstance::~TritonModelInstance() :0
3 0x00000000001a7106 std::_Sp_counted_ptr<triton::core::TritonModelInstance, (__gnu_cxx::_Lock_policy)2>::_M_dispose() :0
4 0x000000000018f34f triton::core::TritonModel::~TritonModel() :0
5 0x000000000018fa4d triton::core::TritonModel::~TritonModel() backend_model.cc:0
6 0x00000000002723a7 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::core::(anonymous namespace)::ModelDeleter::operator()(triton::core::Model)::{lambda()#1}> > >::_M_run() model_lifecycle.cc:0
7 0x00000000000dc253 std::error_code::default_error_condition() ???:0
8 0x0000000000094b43 pthread_condattr_setpshared() ???:0
9 0x0000000000126a00 __xmknodat() ???:0

Signal (11) received.
0# 0x00005559338286ED in tritonserver
1# 0x00007FF8D33A9520 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# 0x00007FF8D3D973DF in /opt/tritonserver/bin/../lib/libtritonserver.so
3# 0x00007FF8D3D9D438 in /opt/tritonserver/bin/../lib/libtritonserver.so
4# 0x00007FF8D3DA4106 in /opt/tritonserver/bin/../lib/libtritonserver.so
5# 0x00007FF8D3D8C34F in /opt/tritonserver/bin/../lib/libtritonserver.so
6# 0x00007FF8D3D8CA4D in /opt/tritonserver/bin/../lib/libtritonserver.so
7# 0x00007FF8D3E6F3A7 in /opt/tritonserver/bin/../lib/libtritonserver.so
8# 0x00007FF8D366B253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
9# 0x00007FF8D33FBB43 in /usr/lib/x86_64-linux-gnu/libc.so.6
10# 0x00007FF8D348DA00 in /usr/lib/x86_64-linux-gnu/libc.so.6

kthui · 2023-11-20T20:19:36Z

Thanks for reporting the issue. I have filed a ticket for us to investigate further.

kthui added the bug Something isn't working label Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in tritonserver when I unload a model that uses implicit state management immediately after I infer the model #6594

Segmentation fault in tritonserver when I unload a model that uses implicit state management immediately after I infer the model #6594

grzywada commented Nov 17, 2023

kthui commented Nov 20, 2023

Segmentation fault in tritonserver when I unload a model that uses implicit state management immediately after I infer the model #6594

Segmentation fault in tritonserver when I unload a model that uses implicit state management immediately after I infer the model #6594

Comments

grzywada commented Nov 17, 2023

kthui commented Nov 20, 2023