-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
triton server failed exited with coredump #4010
Comments
the format was stronger |
I put the code here
} |
The segfault seems to be coming from Torch library, do you still encounter segfault with a different framework? If not, then the issue may not be within Triton. |
@jackzhou121 was the segfault during inference? If so you should attempt to run the model outside Triton using PyTorch (C API) directly to confirm the issue is not specific to pytorch and lies in Triton. |
static std::shared_ptr<TRITONSERVER_Server> g_server; if "g_server" is static wether gloable or local in my function, i have to call TRITONSERVER_ServerDelete(g_server) before my program exited. if "g_server" become local and no static, the program can existed successful. |
static std::shared_ptr<TRITONSERVER_Server> g_server; if "g_server" is static wether gloable or local in my function, i have to call TRITONSERVER_ServerDelete(g_server) before my program exited. if "g_server" become local and no static, the program can existed successful. |
when the program existed, i guss some resouce has double freed, the last free operation ocurred when the tritonserver has release all resouces |
Do you mean in this case you will need to call |
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue. |
Description
triton server exited with core dump
Triton Information
triton version:2.12
Are you using the Triton container or did you build it yourself?
container: nvcr.io/nvidia/tritonserver:21.07-py3
HARDWARE: A30, NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4
To Reproduce
Steps to reproduce the behavior.
here is the code:
#3808
#include
#include
#include
#include
#include
#include
#include
#include <unordered_map>
#include
#include
#include
#include <torch/torch.h>
#include "triton/core/tritonserver.h"
#include "common/common.h"
#define TRITON_ENABLE_GPU 1
#ifdef TRITON_ENABLE_GPU
#include <cuda_runtime_api.h>
#endif
static std::shared_ptr<TRITONSERVER_Server> g_server;
int create_server(std::shared_ptr<TRITONSERVER_Server> &g_server)
{
std::string model_repository_path = "/workspace/triton_tts/triton_tts/build/tts_model_repo_separate";
int verbose_level = 1;
TRITONSERVER_MemoryType requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
#ifdef TRITON_ENABLE_GPU
double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
#else
double ming_compute_capability = 0;
#endif
}
int run(void) {
create_server(g_server);
return 0;
}
int main(){
run();
return 0;
}
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
torchscript model
Expected behavior
the code was run well in the old enviroment but now it broken with following error info:
terminate called after throwing an instance of 'c10::Error'
what(): invalid device pointer: 0x7f5c9b400000
Exception raised from free at ../c10/cuda/CUDACachingAllocator.cpp:1223 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f5ea8ffb24c in /opt/tritonserver/backends/pytorch/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7f5ea8fc6a66 in /opt/tritonserver/backends/pytorch/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x24f (0x7f5ea8f8b9af in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x9c (0x7f5ea8fe51ec in /opt/tritonserver/backends/pytorch/libc10.so)
frame #4: + 0x11b5595 (0x7f5e6f61e595 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #5: + 0x23993 (0x7f5ea98e5993 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #6: + 0x1779b (0x7f5ea98d979b in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #7: TRITONBACKEND_ModelInstanceFinalize + 0x1e4 (0x7f5ea98d9c64 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #8: + 0x309d21 (0x7f5ee7d40d21 in /opt/tritonserver/lib/libtritonserver.so)
frame #9: + 0x305531 (0x7f5ee7d3c531 in /opt/tritonserver/lib/libtritonserver.so)
frame #10: + 0x305bdd (0x7f5ee7d3cbdd in /opt/tritonserver/lib/libtritonserver.so)
frame #11: + 0x1857a7 (0x7f5ee7bbc7a7 in /opt/tritonserver/lib/libtritonserver.so)
frame #12: + 0xd6de4 (0x7f5ee7922de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #13: + 0x9609 (0x7f5eccdef609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #14: clone + 0x43 (0x7f5ee7761293 in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered: