Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pytorch model] Triton inference server didn't response the second request from client (only run with first request) #6593

Closed
nhthanh0809 opened this issue Nov 17, 2023 · 12 comments

Comments

@nhthanh0809
Copy link

nhthanh0809 commented Nov 17, 2023

Description
Hi all,
I got this problem when running my own pytorch model (is converted to Torchscript) on Triton inference server.
I followed to this guide to run my own pytorch model on triton https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/PyTorch

Following to this guide, the client received all responses from Triton server with Resnet50 model
However, when i run Triton server with my own model (Key point detection model) then make requests into the server, the server only response the first request to client. The client side didn't receive any response from the server. Finally, the client got Timeout errors as below:
File "/src/Triton_inference/ceph/server/2.Client.rev.py", line 103, in <module>
    results = client.infer(model_name="Steiner", inputs=[inputs], outputs=[outputs])
  File "/home/xx/.local/lib/python3.10/site-packages/tritonclient/http/_client.py", line 1462, in infer
    response = self._post(
  File "/home/xx/.local/lib/python3.10/site-packages/tritonclient/http/_client.py", line 290, in _post
    response = self._client_stub.post(
  File "/home/xx/.local/lib/python3.10/site-packages/geventhttpclient/client.py", line 275, in post
    return self.request(METHOD_POST, request_uri, body=body, headers=headers)
  File "/home/xx/.local/lib/python3.10/site-packages/geventhttpclient/client.py", line 256, in request
    response = HTTPSocketPoolResponse(sock, self._connection_pool,
  File "/home/xx/.local/lib/python3.10/site-packages/geventhttpclient/response.py", line 292, in __init__
    super(HTTPSocketPoolResponse, self).__init__(sock, **kw)
  File "/home/xx/.local/lib/python3.10/site-packages/geventhttpclient/response.py", line 164, in __init__
    self._read_headers()
  File "/home/xx/.local/lib/python3.10/site-packages/geventhttpclient/response.py", line 184, in _read_headers
    data = self._sock.recv(self.block_size)
  File "/home/xx/.local/lib/python3.10/site-packages/gevent/_socketcommon.py", line 666, in recv
    self._wait(self._read_event)
  File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 313, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 314, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
TimeoutError: timed out

Triton Information
I'm using Triton 23.10, cuda 12.2, nvidia driver version 535.129.03
I've tested on GTX 1090 Ti and GTX 3080 and got the same the problem
I'm using triton inference container: nvcr.io/nvidia/tritonserver23.10-py3

Here is my configuration for my model

name: "Steiner"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 1, 3, 512, 480 ]
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1, 22, 512, 480 ]
  }
]

My client code:

image = cv2.imread(IMAGE_PATH)
 img_h, img_w, _ = image.shape

 scal_ratio_w = img_w / 480
 scal_ratio_h = img_h / 512

 img_resize = cv2.resize(image, (480, 512))
 output_image = img_resize.copy()
 img_data = np.transpose(img_resize, (2, 0, 1))
 # img_data = np.reshape(img_data, (1, 3, 512, 480))
 img_data = np.array(img_data, dtype='float32')

 # Normalize the image (divide by 255 to bring values between 0 and 1)
 img_data = img_data / 255.0
     
 # Convert image to float32 and add batch dimension
 img_data = np.expand_dims(img_data, axis=0).astype(np.float32)

 # Transpose image dimensions to [batch, channels, height, width]
 img_data = np.transpose(img_data, (0, 1, 2, 3))
     
 print(img_data.shape)
 inputs = httpclient.InferInput("input__0", img_data.shape, datatype="FP32")
 inputs.set_data_from_numpy(img_data)

 outputs = httpclient.InferRequestedOutput("output__0", binary_data=True)

 # Querying the server
 results = client.infer(model_name="Steiner", inputs=[inputs], outputs=[outputs])
 inference_output = results.as_numpy("output__0")
 print(inference_output.shape)

With the first request from client, the above client code will print shape of input and shape of output as below:

(1, 3, 512, 480)
(1, 22, 512, 480)

However, with the another request, the server didn't response anything, leading to Timeout error in the client side.
I've check log in the Triton server then get below log:

  1. The first request from client

1117 06:29:29.192152 94 http_server.cc:3514] HTTP request: 2 /v2/models/Steiner/infer
I1117 06:29:29.192202 94 model_lifecycle.cc:328] GetModel() 'Steiner' version -1
I1117 06:29:29.192223 94 model_lifecycle.cc:328] GetModel() 'Steiner' version -1
I1117 06:29:29.192325 94 infer_request.cc:117] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I1117 06:29:29.192344 94 infer_request.cc:857] [request id: <id_unknown>] prepared: [0x0x7f6c68002510] request id: , model: Steiner, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6c68002d38] input: input__0, type: FP32, original shape: [1,3,512,480], batch + shape: [1,3,512,480], shape: [3,512,480]
override inputs:
inputs:
[0x0x7f6c68002d38] input: input__0, type: FP32, original shape: [1,3,512,480], batch + shape: [1,3,512,480], shape: [3,512,480]
original requested outputs:
output__0
requested outputs:
output__0

I1117 06:29:29.192362 94 infer_request.cc:117] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I1117 06:29:29.192412 94 infer_request.cc:117] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I1117 06:29:29.192452 94 libtorch.cc:2666] model Steiner, instance Steiner_0, executing 1 requests
I1117 06:29:29.192461 94 libtorch.cc:1224] TRITONBACKEND_ModelExecute: Running Steiner_0 with 1 requests
I1117 06:29:29.195628 94 pinned_memory_manager.cc:162] pinned memory allocation: size 2949120, addr 0x7f6db0000090
I1117 06:29:29.675656 94 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,22,512,480]
I1117 06:29:29.675683 94 http_server.cc:1103] HTTP: unable to provide 'output__0' in GPU, will use CPU
I1117 06:29:29.675707 94 http_server.cc:1123] HTTP using buffer for: 'output__0', size: 21626880, addr: 0x7f6c497fe040
I1117 06:29:29.675729 94 pinned_memory_manager.cc:162] pinned memory allocation: size 21626880, addr 0x7f6db02d00a0
I1117 06:29:29.690363 94 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7f6db02d00a0
I1117 06:29:29.690433 94 http_server.cc:1197] HTTP release: size 21626880, addr 0x7f6c497fe040
I1117 06:29:29.690465 94 infer_request.cc:117] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I1117 06:29:29.690496 94 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7f6db0000090

  1. Other request from client (same image, request to the same model)
I1117 06:30:06.292550 94 http_server.cc:3514] HTTP request: 2 /v2/models/Steiner/infer
I1117 06:30:06.292592 94 model_lifecycle.cc:328] GetModel() 'Steiner' version -1
I1117 06:30:06.292604 94 model_lifecycle.cc:328] GetModel() 'Steiner' version -1
I1117 06:30:06.292670 94 infer_request.cc:117] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I1117 06:30:06.292688 94 infer_request.cc:857] [request id: <id_unknown>] prepared: [0x0x7f6c68004060] request id: , model: Steiner, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f6c68003a18] input: input__0, type: FP32, original shape: [1,3,512,480], batch + shape: [1,3,512,480], shape: [3,512,480]
override inputs:
inputs:
[0x0x7f6c68003a18] input: input__0, type: FP32, original shape: [1,3,512,480], batch + shape: [1,3,512,480], shape: [3,512,480]
original requested outputs:
output__0
requested outputs:
output__0

I1117 06:30:06.292723 94 infer_request.cc:117] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I1117 06:30:06.292753 94 infer_request.cc:117] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I1117 06:30:06.292776 94 libtorch.cc:2666] model Steiner, instance Steiner_0, executing 1 requests
I1117 06:30:06.292786 94 libtorch.cc:1224] TRITONBACKEND_ModelExecute: Running Steiner_0 with 1 requests
I1117 06:30:06.301017 94 pinned_memory_manager.cc:162] pinned memory allocation: size 2949120, addr 0x7f6db0000090

Expected behavior

With every request from client, the Triton must response the same output.
But in my problem, the server only run well with the first request but other request.

@kthui
Copy link
Contributor

kthui commented Nov 20, 2023

Hi @nhthanh0809, I am wondering if the model supports batching?

dims: [ 1, 3, 512, 480 ]
        ^---- Is this a batch dimension?

@nhthanh0809
Copy link
Author

nhthanh0809 commented Nov 21, 2023

Hi @kthui
The model supports batching.

dims: [ 1, 3, 512, 480 ]
        ^---- Is this a batch dimension?

Yes, It is.
Because I set max_batch_size = 0, so i set the dims config like that.
I even tried to set max_batch_size > 0 and the dims config as [ 3, 512, 480 ]. The Triton server still run smoothly with the first request. But from second request from client, the Triton server didn't response anything. The client still get timeout error.
I assume that the problem come from Triton server because with the first request, the server still works well.

@nhthanh0809
Copy link
Author

nhthanh0809 commented Nov 21, 2023

I checked diffrence of Triton server log between the first and the second request (in above log) and saw that the server exported this part of log for the first request:

I1117 06:29:29.675656 94 infer_response.cc:167] add response output: output: output__0, type: FP32, shape: [1,22,512,480]
I1117 06:29:29.675683 94 http_server.cc:1103] HTTP: unable to provide 'output__0' in GPU, will use CPU
I1117 06:29:29.675707 94 http_server.cc:1123] HTTP using buffer for: 'output__0', size: 21626880, addr: 0x7f6c497fe040
I1117 06:29:29.675729 94 pinned_memory_manager.cc:162] pinned memory allocation: size 21626880, addr 0x7f6db02d00a0
I1117 06:29:29.690363 94 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7f6db02d00a0
I1117 06:29:29.690433 94 http_server.cc:1197] HTTP release: size 21626880, addr 0x7f6c497fe040
I1117 06:29:29.690465 94 infer_request.cc:117] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I1117 06:29:29.690496 94 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7f6db0000090 

But in the log of the second request, the server didn't log this part.

@nhthanh0809
Copy link
Author

[Update] I downgraded the triton server container to 22.09. The problem has been solved.

@kthui
Copy link
Contributor

kthui commented Nov 21, 2023

Thanks for the update. I have filed a ticket to investigate if this is introduced after 22.09.

@kthui kthui added the bug Something isn't working label Nov 21, 2023
@kthui
Copy link
Contributor

kthui commented Mar 13, 2024

Hi @nhthanh0809, apologize it took a while for the ticket to get triaged. I am able to replicate the server stops printing new logs at

...
I1117 06:30:06.292776 94 libtorch.cc:2666] model Steiner, instance Steiner_0, executing 1 requests
I1117 06:30:06.292786 94 libtorch.cc:1224] TRITONBACKEND_ModelExecute: Running Steiner_0 with 1 requests
I1117 06:30:06.301017 94 pinned_memory_manager.cc:162] pinned memory allocation: size 2949120, addr 0x7f6db0000090

(extracted from your second log)
if the LibTorch framework in the PyTorch backend is stuck at inferencing the model. Unfortunately, I am not able to reproduce the LibTorch framework hang.

Are you still able to reproduce the issue with a later release of Triton, i.e. 24.02? The issue might have been resolved by the PyTorch team. If it is still reproducible with the latest release of Triton, can you provide a complete minimal reproduction?

  • A model with the model_file.pt and the complete config.pbtxt; and
  • Step-by-step instructions on how you launch the server; and
  • The minimal but complete client.py, which you initiate both inferences and got stuck at the second inference request. Plus, the image (or link) if needed.

@vonchenplus
Copy link

vonchenplus commented Mar 14, 2024

Hello @kthui ,

I encountered the same issue in version 23.08-py3, but it was not present in version 24.02. I'm not sure which version fixed this problem.

The following is the log when terminal:

[W graph_fuser.cpp:104] Warning: operator() profile_node %718 : int[] = prim::profile_ivalue(%716)
does not have profile information (function operator())
^CSignal (2) received.
I0314 08:29:03.133663 3214233 server.cc:305] Waiting for in-flight requests to complete.
I0314 08:29:03.133751 3214233 server.cc:321] Timeout 30: Found 0 model versions that have in-flight inferences
I0314 08:29:03.133782 3214233 server.cc:336] All models are stopped, unloading models
I0314 08:29:03.133809 3214233 server.cc:343] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:04.133948 3214233 server.cc:343] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:05.134164 3214233 server.cc:343] Timeout 28: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:06.134346 3214233 server.cc:343] Timeout 27: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:07.134530 3214233 server.cc:343] Timeout 26: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:08.134705 3214233 server.cc:343] Timeout 25: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:09.134855 3214233 server.cc:343] Timeout 24: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:10.135017 3214233 server.cc:343] Timeout 23: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:11.135188 3214233 server.cc:343] Timeout 22: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:12.135404 3214233 server.cc:343] Timeout 21: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:13.135590 3214233 server.cc:343] Timeout 20: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:14.135760 3214233 server.cc:343] Timeout 19: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:15.135923 3214233 server.cc:343] Timeout 18: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:16.136099 3214233 server.cc:343] Timeout 17: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:17.136272 3214233 server.cc:343] Timeout 16: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:18.136441 3214233 server.cc:343] Timeout 15: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:19.136620 3214233 server.cc:343] Timeout 14: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:20.136798 3214233 server.cc:343] Timeout 13: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:21.136992 3214233 server.cc:343] Timeout 12: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:22.137131 3214233 server.cc:343] Timeout 11: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:23.137326 3214233 server.cc:343] Timeout 10: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:24.137487 3214233 server.cc:343] Timeout 9: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:25.137660 3214233 server.cc:343] Timeout 8: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:26.137830 3214233 server.cc:343] Timeout 7: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:27.138005 3214233 server.cc:343] Timeout 6: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:28.138162 3214233 server.cc:343] Timeout 5: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:29.138354 3214233 server.cc:343] Timeout 4: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:30.138567 3214233 server.cc:343] Timeout 3: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:31.138757 3214233 server.cc:343] Timeout 2: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:32.138934 3214233 server.cc:343] Timeout 1: Found 2 live models and 0 in-flight non-inference requests
I0314 08:29:33.139113 3214233 server.cc:343] Timeout 0: Found 2 live models and 0 in-flight non-inference requests
E0314 08:29:33.139223 3214233 main.cc:517] failed to stop server: Internal - Exit timeout expired. Exiting immediately.
end to execute models
I0314 08:29:33.401635 3214233 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: driver shutting down
I0314 08:29:33.401672 3214233 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: driver shutting down
I0314 08:29:33.401761 3214233 backend_memory.cc:186] failed to free memory buffer: Internal - Failed to get device: driver shutting down
I0314 08:29:33.404131 3214233 libtorch.cc:2634] TRITONBACKEND_ModelInstanceFinalize: delete instance state
terminate called after throwing an instance of 'c10::Error'
what(): invalid device pointer: 0x7fb8f1200000
Exception raised from free at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:3170 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xae (0x7fb9dd1b12ce in /opt/tritonserver/backends/pytorch/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::_cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb9dd16798b in /opt/tritonserver/backends/pytorch/libc10.so)
frame #2: + 0x18834 (0x7fb9dd0e8834 in /opt/tritonserver/backends/pytorch/libc10_cuda.so)
frame #3: c10::StorageImpl::~StorageImpl() + 0x42 (0x7fb9ddbfe332 in /opt/tritonserver/backends/pytorch/libtorchtrt_runtime.so)
frame #4: c10::TensorImpl::~TensorImpl() + 0xd (0x7fb9dd18cedd in /opt/tritonserver/backends/pytorch/libc10.so)
frame #5: c10::intrusive_ptr<c10::intrusive_ptr_target, c10::UndefinedTensorImpl>::reset
() + 0xed (0x7fb9ddd3ea01 in /opt/tritonserver/backends/pytorch/libtorchvision.so)
frame #6: c10::intrusive_ptr<c10::intrusive_ptr_target, c10::UndefinedTensorImpl>::~intrusive_ptr() + 0x1c (0x7fb9ddd3dabe in /opt/tritonserver/backends/pytorch/libtorchvision.so)
frame #7: c10::IValue::destroy() + 0x97 (0x7fb9ddd3c483 in /opt/tritonserver/backends/pytorch/libtorchvision.so)
frame #8: c10::ivalue::Object::~Object() + 0x3c (0x7fb9ddc01c9c in /opt/tritonserver/backends/pytorch/libtorchtrt_runtime.so)
frame #9: + 0x3058b (0x7fb9de4cc58b in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #10: + 0x340e8 (0x7fb9de4d00e8 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #11: + 0x30e7a (0x7fb9de4cce7a in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #12: + 0x18b15 (0x7fb9de4b4b15 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #13: TRITONBACKEND_ModelInstanceFinalize + 0x218 (0x7fb9de4b5818 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #14: + 0x19f35a (0x7fb9ea7a335a in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #15: + 0x1a5e96 (0x7fb9ea7a9e96 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #16: + 0x18e6ef (0x7fb9ea7926ef in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #17: + 0x18eddd (0x7fb9ea792ddd in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #18: + 0x26c187 (0x7fb9ea870187 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #19: + 0xdc253 (0x7fb9ea072253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #20: + 0x94b43 (0x7fb9e9e02b43 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: + 0x126a00 (0x7fb9e9e94a00 in /usr/lib/x86_64-linux-gnu/libc.so.6)

@kthui
Copy link
Contributor

kthui commented Mar 19, 2024

Hi @vonchenplus, I think the quickest way to find the version that fixed the problem is to check every release from 23.08 and up using the pre-built containers from NGC until you find the one that is working. Otherwise, you could also reach out to the PyTorch team for help.

@kthui
Copy link
Contributor

kthui commented Mar 19, 2024

Hi @nhthanh0809, without the complete reproduction, we are not able to pinpoint if the issue is in the PyTorch framework or somewhere else. Please feel free to re-open this issue if you need this followed-up.

@kthui kthui closed this as completed Mar 19, 2024
@kthui kthui removed the bug Something isn't working label Mar 19, 2024
@thanhnguyentung95
Copy link

thanhnguyentung95 commented Apr 18, 2024

I encountered the same issue in version 23.08-py3, but it was not present in version 24.02. I'm not sure which version fixed this problem.

I encountered the same issue in two newest versions 24.02-py3 and 24.03-py3. I think this issue is model-dependent.

@thanhnguyentung95
Copy link

thanhnguyentung95 commented Apr 19, 2024

I found that this issue only occur when running on GPU. Inference on CPU is ok.

@lutianming
Copy link

We encountered the same issue. After several attempts, we found using DISABLE_OPTIMIZED_EXECUTION solves the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants