Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to load custom python environment with python backend #3480

Closed
stellaywu opened this issue Oct 18, 2021 · 24 comments
Closed

unable to load custom python environment with python backend #3480

stellaywu opened this issue Oct 18, 2021 · 24 comments
Labels
bug Something isn't working

Comments

@stellaywu
Copy link

stellaywu commented Oct 18, 2021

I'm trying to use a custom environment for a pytorch model served with the python backend
this is the config file

name: "model1"
backend: "python"

input [
  {
    name: "INPUT0"
    data_type: TYPE_FP32
    dims: [ 3 ]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ 2 ]
  }


instance_group [{ kind: KIND_CPU }]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/model1/python-3-8.tar.gz"}
}

The file structure is like this

|-- model1
|   |-- 1
|   |    -- model_ckpt.pb
|   |   `-- model.py
|   |-- config.pbtxt
|   |-- python-3-8.tar.gz
|   `-- triton_python_backend_stub

I'm getting the error
UNAVAILABLE: Internal: Failed to get the canonical path for $$TRITON_MODEL_DIRECTORY/model1/python-3-8.tar.gz.

Please help !

@Tabrizian
Copy link
Member

What is the Triton version that you are using? $$TRITON_MODEL_DIRECTORY was introduced in 21.09.

@stellaywu
Copy link
Author

Thank you @Tabrizian ! By updating the triton version to 21.09 has solved the original problem, but now I'm encountered another problem that the server doesn't start . Below are the details of the setup and error, appreciate your help!

we are trying to deploy the 21.09 version using Kubernetes - however when pulling the image and trying to run it we get the following issues:

- command:
 - /bin/bash
 - -c
 args:
 - triton-server gs://bucket/model-path

The error for the above command is:

/bin/bash: triton-server: command not found

When we try to run it using the following command:

- command:
 - /bin/bash
 - -c
 args:
 - /opt/tritonserver/bin/tritonserver gs://bucket/model-path

Instead of running the server it prints out:

Error: Failed to initialize NVML
W1019 14:21:36.711581 1 metrics.cc:213] DCGM unable to start: DCGM initialization error
I1019 14:21:37.316686 1 libtorch.cc:1030] TRITONBACKEND_Initialize: pytorch
I1019 14:21:37.316761 1 libtorch.cc:1040] Triton TRITONBACKEND API version: 1.5
I1019 14:21:37.316781 1 libtorch.cc:1046] 'pytorch' TRITONBACKEND API version: 1.5
2021-10-19 14:21:37.599704: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I1019 14:21:37.704036 1 tensorflow.cc:2170] TRITONBACKEND_Initialize: tensorflow
I1019 14:21:37.704099 1 tensorflow.cc:2180] Triton TRITONBACKEND API version: 1.5
I1019 14:21:37.704118 1 tensorflow.cc:2186] 'tensorflow' TRITONBACKEND API version: 1.5
I1019 14:21:37.704136 1 tensorflow.cc:2210] backend configuration:
{}
I1019 14:21:37.709603 1 onnxruntime.cc:1997] TRITONBACKEND_Initialize: onnxruntime
I1019 14:21:37.709657 1 onnxruntime.cc:2007] Triton TRITONBACKEND API version: 1.5
I1019 14:21:37.709695 1 onnxruntime.cc:2013] 'onnxruntime' TRITONBACKEND API version: 1.5
I1019 14:21:37.752110 1 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I1019 14:21:37.752172 1 openvino.cc:1203] Triton TRITONBACKEND API version: 1.5
I1019 14:21:37.752196 1 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.5
W1019 14:21:37.752478 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I1019 14:21:37.752620 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
I1019 14:21:39.417718 1 model_repository_manager.cc:1022] loading: detectron2:1
I1019 14:21:43.432737 1 python.cc:1529] Using Python execution env /tmp/folder5yvdog/python-3-8.tar.gz
I1019 14:21:43.432918 1 python.cc:1796] TRITONBACKEND_ModelInstanceInitialize: detectron2_0 (CPU device 0)
/opt/tritonserver
total 2984
-rw-rw-r-- 1 triton-server triton-server  1485 Sep 23 17:32 LICENSE
-rw-rw-r-- 1 triton-server triton-server 3012640 Sep 23 17:32 NVIDIA_Deep_Learning_Container_License.pdf
-rw-rw-r-- 1 triton-server triton-server    7 Sep 23 17:32 TRITON_VERSION
drwxr-xr-x 1 triton-server triton-server  4096 Sep 23 18:18 backends
drwxr-xr-x 2 triton-server triton-server  4096 Sep 23 18:17 bin
drwxr-xr-x 1 triton-server triton-server  4096 Sep 23 18:17 include
drwxr-xr-x 2 triton-server triton-server  4096 Sep 23 18:17 lib
-rwxrwxr-x 1 triton-server triton-server  3982 Sep 23 17:32 nvidia_entrypoint.sh
drwxr-xr-x 3 triton-server triton-server  4096 Sep 23 18:20 repoagents
drwxr-xr-x 2 triton-server triton-server  4096 Sep 23 18:17 third-party-src

@Tabrizian
Copy link
Member

I think you need to wait long enough. Since your execution environment can be very large and sometimes it is required to download the environment from S3.

@stellaywu
Copy link
Author

@Tabrizian We've waited for a long time (hours) it still stuck at the same place, no more print. Wondering if it has to do with the custom environment as when we remove the customized environment it went through till loading models. With the custom environment it ends at the above message.

@stellaywu
Copy link
Author

stellaywu commented Oct 19, 2021

@Tabrizian To provide more context, It's a Dectectron2 model, we are also trying to serve it with the PyTorch backend while trying to figure out the python backed.
Environment

ubuntu 20.04
python 3.8.10
pytorch==1.9.1+cpu
torchvision==0.10.1+cpu
detectron2==0.5
tritonserver=21.08

Following this answer here #2025 (comment)
and run into the following issue:

class Wrapper(torch.nn.Module):
    def __init__(self):
        super(Wrapper, self).__init__()
        self.model = torch.jit.load("traced_pytorch_model_ckpt.pt")

    def forward(self, x: torch.Tensor, y: torch.Tensor):
        return self.model.forward((x, y))

m = torch.jit.script(Wrapper())

returns this error

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_72/2879316277.py in <module>
      8         return self.model.forward((x, y))
      9 
---> 10 m = torch.jit.script(Wrapper())

/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py in script(obj, optimize, _frames_up, _rcb)
   1094     if isinstance(obj, torch.nn.Module):
   1095         obj = call_prepare_scriptable_func(obj)
-> 1096         return torch.jit._recursive.create_script_module(
   1097             obj, torch.jit._recursive.infer_methods_to_compile
   1098         )

/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py in create_script_module(nn_module, stubs_fn, share_types)
    410     concrete_type = get_module_concrete_type(nn_module, share_types)
    411     AttributeTypeIsSupportedChecker().check(nn_module)
--> 412     return create_script_module_impl(nn_module, concrete_type, stubs_fn)
    413 
    414 def create_script_module_impl(nn_module, concrete_type, stubs_fn):

/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py in create_script_module_impl(nn_module, concrete_type, stubs_fn)
    476     # Compile methods if necessary
    477     if concrete_type not in concrete_type_store.methods_compiled:
--> 478         create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
    479         # Create hooks after methods to ensure no name collisions between hooks and methods.
    480         # If done before, hooks can overshadow methods that aren't exported.

/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py in create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
    353     property_rcbs = [p.resolution_callback for p in property_stubs]
    354 
--> 355     concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
    356 
    357 def create_hooks_from_stubs(concrete_type, hook_stubs, pre_hook_stubs):

RuntimeError: 

forward(__torch__.detectron2.export.flatten.TracingAdapter self, Tensor argument_1) -> ((Tensor, Tensor, Tensor, Tensor)):
Expected a value of type 'Tensor' for argument 'argument_1' but instead found type 'Tuple[Tensor, Tensor]'.
:
  File "/tmp/ipykernel_72/2879316277.py", line 8
    def forward(self, x: torch.Tensor, y: torch.Tensor):
        return self.model.forward((x, y))
               ~~~~~~~~~~~~~~~~~~ <--- HERE

the error message above looks asking for one tensor input instead a tuple of two inputs.

If I skip this dummy wrapper and use the traced model for inference directly, I'll get this error which indicate it needs two inputs

input shape (720, 1280, 3) int64
Traceback (most recent call last):
  File "triton_client.py", line 109, in <module>
    run_inference('/root/image_large.jpeg')
  File "triton_client.py", line 85, in run_inference
    response = client.infer(model_name, inputs, request_id=str(1), outputs=outputs)
  File "/opt/conda/lib/python3.8/site-packages/tritonclient/http/__init__.py", line 1256, in infer
    _raise_if_error(response)
  File "/opt/conda/lib/python3.8/site-packages/tritonclient/http/__init__.py", line 64, in _raise_if_error
    raise error
tritonclient.utils.InferenceServerException: PyTorch execute failure: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/detectron2/export/flatten/___torch_mangle_610.py", line 16, in forward
    _5 = self.model.pixel_mean
    x = torch.to(argument_1, dtype=0, layout=0, device=torch.device("cpu"))
    t = torch.div(torch.sub(x, _5), _4)
                  ~~~~~~~~~ <--- HERE
    _6 = ops.prim.NumToTensor(torch.size(t, 1))
    _7 = ops.prim.NumToTensor(torch.size(t, 2))

Traceback of TorchScript, original code (most recent call last):
/opt/conda/lib/python3.8/site-packages/detectron2/modeling/meta_arch/retinanet.py(493): <listcomp>
/opt/conda/lib/python3.8/site-packages/detectron2/modeling/meta_arch/retinanet.py(493): preprocess_image
/opt/conda/lib/python3.8/site-packages/detectron2/modeling/meta_arch/retinanet.py(251): forward
/tmp/ipykernel_7849/3278119026.py(3): inference_func
/opt/conda/lib/python3.8/site-packages/detectron2/export/flatten.py(293): forward
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py(1039): _slow_forward
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py(1051): _call_impl
/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py(952): trace_module
/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py(735): trace
/tmp/ipykernel_7849/3278119026.py(11): <module>
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py(3444): run_code
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py(3364): run_ast_nodes
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py(3172): run_cell_async
/opt/conda/lib/python3.8/site-packages/IPython/core/async_helpers.py(68): _pseudo_sync_runner
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py(2947): _run_cell
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py(2901): run_cell
/opt/conda/lib/python3.8/site-packages/ipykernel/zmqshell.py(533): run_cell
/opt/conda/lib/python3.8/site-packages/ipykernel/ipkernel.py(353): do_execute
/opt/conda/lib/python3.8/site-packages/ipykernel/kernelbase.py(648): execute_request
/opt/conda/lib/python3.8/site-packages/ipykernel/kernelbase.py(353): dispatch_shell
/opt/conda/lib/python3.8/site-packages/ipykernel/kernelbase.py(446): process_one
/opt/conda/lib/python3.8/site-packages/ipykernel/kernelbase.py(457): dispatch_queue
/opt/conda/lib/python3.8/asyncio/events.py(81): _run
/opt/conda/lib/python3.8/asyncio/base_events.py(1859): _run_once
/opt/conda/lib/python3.8/asyncio/base_events.py(570): run_forever
/opt/conda/lib/python3.8/site-packages/tornado/platform/asyncio.py(199): start
/opt/conda/lib/python3.8/site-packages/ipykernel/kernelapp.py(677): start
/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py(846): launch_instance
/opt/conda/lib/python3.8/site-packages/ipykernel_launcher.py(16): <module>
/opt/conda/lib/python3.8/runpy.py(87): _run_code
/opt/conda/lib/python3.8/runpy.py(194): _run_module_as_main
RuntimeError: The size of tensor a (720) must match the size of tensor b (3) at non-singleton dimension 0

Thank you !

@Tabrizian
Copy link
Member

@CoderHam Could you help with how they can serve it with PyTorch backend?

@stellaywu Can you provide all the details about how you created the environment so that we can look into it?

@CoderHam
Copy link
Contributor

@stellaywu would you need to trace / script your model. Why do you need to create the wrapper for your model before scripting / tracing it?
Refer to https://pytorch.org/docs/stable/generated/torch.jit.trace.html for steps to produce the torchscript model.

@stellaywu
Copy link
Author

@CoderHam thanks for replying ! I have traced the model before creating the wrapper. The traced model is the input for the wrapper. traced_pytorch_model_ckpt.pt is the traced model.

@arshia-shakudo
Copy link

@Tabrizian Appreciate your help so far. We installed the custom packages using conda and used conda-pack to pack the packages and then we are serving the custom package on a cloud provider (GCP).

The cuda version used during the custom package installation was cuda-11-0. The version we used to create the custom environment for triton is also 21.09

We then set the cuda root directory using:

cmake -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.0 ..

And then we followed the steps outlined in the documentation here.

@Tabrizian
Copy link
Member

Tabrizian commented Oct 20, 2021

@arshiamalek We have not tested the Custom Python Execution Environments with GCS. We have only tested this with Amazon S3 so there could be issues with this. Have you tried not using the GCS to see whether it works or not? I have filed a ticket to add testing for GCS.

@stellaywu
Copy link
Author

stellaywu commented Oct 21, 2021

@CoderHam I managed to got over the original issue and make the model inference on Triton with pytorch backend. However the inference result is quite different from Triton vs direct inference with the original PyTorch model. The scripted model is created following this code on Detectron repo . Could you please help with identifying what could be the cause?

@stella-ds
Copy link

@CoderHam I'm having another issue when running the deployment on Triton GPU similar to this issue #2024

I'm using the 21.09 triton, the traced model is able to run outside of triton.

Here is the full trace. Appreciate your help!

InferenceServerException: PyTorch execute failure: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/detectron2/export/flatten.py", line 21, in forward
    image_size = torch.stack([_6, _7])
    max_size, _8 = torch.max(torch.stack([image_size]), 0)
    _9 = torch.floor_divide(torch.add(max_size, CONSTANTS.c0), CONSTANTS.c1)
         ~~~~~~~~~~~~~~~~~~ <--- HERE
    max_size0 = torch.mul(_9, CONSTANTS.c1)
    _10 = torch.sub(torch.select(max_size0, 0, -1), torch.select(image_size, 0, 1))

Traceback of TorchScript, original code (most recent call last):
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py(575): __floordiv__
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py(29): wrapped
/usr/local/lib/python3.7/dist-packages/detectron2/structures/image_list.py(99): from_tensors
/usr/local/lib/python3.7/dist-packages/detectron2/modeling/meta_arch/retinanet.py(494): preprocess_image
/usr/local/lib/python3.7/dist-packages/detectron2/modeling/meta_arch/retinanet.py(251): forward
<ipython-input-21-7b355d3f80b6>(3): inference_func
/usr/local/lib/python3.7/dist-packages/detectron2/export/flatten.py(293): forward
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(1039): _slow_forward
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(1051): _call_impl
/usr/local/lib/python3.7/dist-packages/torch/jit/_trace.py(959): trace_module
/usr/local/lib/python3.7/dist-packages/torch/jit/_trace.py(744): trace
<ipython-input-21-7b355d3f80b6>(10): <module>
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py(2882): run_code
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py(2822): run_ast_nodes
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py(2718): run_cell
/usr/local/lib/python3.7/dist-packages/ipykernel/zmqshell.py(537): run_cell
/usr/local/lib/python3.7/dist-packages/ipykernel/ipkernel.py(208): do_execute
/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py(399): execute_request
/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py(233): dispatch_shell
/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py(283): dispatcher
/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py(300): null_wrapper
/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py(431): _run_callback
/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py(481): _handle_recv
/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py(452): _handle_events
/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py(300): null_wrapper
/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py(122): _handle_events
/usr/lib/python3.7/asyncio/events.py(88): _run
/usr/lib/python3.7/asyncio/base_events.py(1786): _run_once
/usr/lib/python3.7/asyncio/base_events.py(541): run_forever
/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py(132): start
/usr/local/lib/python3.7/dist-packages/ipykernel/kernelapp.py(499): start
/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py(846): launch_instance
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py(16): <module>
/usr/lib/python3.7/runpy.py(85): _run_code
/usr/lib/python3.7/runpy.py(193): _run_module_as_main
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

@arshia-shakudo
Copy link

@Tabrizian One issue when we are adding our custom python backend stub is this:

triton_python_backend_stub: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.

This is thrown when we create a custom backend stub outlined here.

We are on triton version 21.09 and the backend stub was created using conda 11.0

@CoderHam
Copy link
Contributor

I'm using the 21.09 triton, the traced model is able to run outside of triton.

Did you run the model using the PyTorch C API (libtorch) or the Python API? Since Triton uses the C API, we want to ensure we are making an apples to apples comparison.

@Tabrizian
Copy link
Member

@arshiamalek Have you build the triton_python_backend_stub using the correct branch? You need to clone the r21.09 branch of Python backend.

@arshia-shakudo
Copy link

@Tabrizian Yes I used the r21.09 branch

@Tabrizian
Copy link
Member

Can you open a separate issue and put all the details there (i.e. your model repository, your client, the steps you followed to build the Custom Python Execution Environment)? Thanks

@stella-ds
Copy link

@Tabrizian made a new issue #3495, thanks for your help!

@stella-ds
Copy link

stella-ds commented Oct 27, 2021

@Tabrizian I gave it another try with python backend on GPU, produced the python-3-8.tar.gz file and triton_python_backend_stub file on Triton 21.09, python3.8, ubuntu20.04

I used these command in my yaml to start the triton server
/opt/tritonserver/bin/tritonserver --model-store=gs://triton/model_repository --model-control-mode=poll --repository-poll-secs=10

We also tried to start the triton pod and start the server with the above command within the pod get same results.

I can see the files in my triton server pod for the python backend model /tmp/folderJjSKvf

/tmp/folderJjSKvf:
total 2524064
drwx------ 3 root root       4096 Oct 27 17:10 1
---x------ 1 root root     480608 Oct 27 17:09 triton_python_backend_stub
-rw-r--r-- 1 root root 2584142637 Oct 27 17:09 python-3-8.tar.gz
-rw-r--r-- 1 root root        590 Oct 27 17:09 config.pbtxt

The triton server didn't throw any error message but hangs with this message at the end

 I1027 18:14:48.374443 1 metrics.cc:290] Collecting metrics for GPU 0: Tesla P4                                                                                                 │
│ I1027 18:14:48.655109 1 libtorch.cc:1030] TRITONBACKEND_Initialize: pytorch                                                                                                    │
│ I1027 18:14:48.655140 1 libtorch.cc:1040] Triton TRITONBACKEND API version: 1.5                                                                                                │
│ I1027 18:14:48.655145 1 libtorch.cc:1046] 'pytorch' TRITONBACKEND API version: 1.5                                                                                             │
│ 2021-10-27 18:14:48.788321: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0                              │
│ I1027 18:14:48.833502 1 tensorflow.cc:2170] TRITONBACKEND_Initialize: tensorflow                                                                                               │
│ I1027 18:14:48.833533 1 tensorflow.cc:2180] Triton TRITONBACKEND API version: 1.5                                                                                              │
│ I1027 18:14:48.833538 1 tensorflow.cc:2186] 'tensorflow' TRITONBACKEND API version: 1.5                                                                                        │
│ I1027 18:14:48.833543 1 tensorflow.cc:2210] backend configuration:                                                                                                             │
│ {}                                                                                      
I1027 18:14:48.835179 1 onnxruntime.cc:1997] TRITONBACKEND_Initialize: onnxruntime                                                                                             │
│ I1027 18:14:48.835202 1 onnxruntime.cc:2007] Triton TRITONBACKEND API version: 1.5                                                                                             │
│ I1027 18:14:48.835207 1 onnxruntime.cc:2013] 'onnxruntime' TRITONBACKEND API version: 1.5                                                                                      │
│ I1027 18:14:48.855404 1 openvino.cc:1193] TRITONBACKEND_Initialize: openvino                                                                                                   │
│ I1027 18:14:48.855430 1 openvino.cc:1203] Triton TRITONBACKEND API version: 1.5                                                                                                │
│ I1027 18:14:48.855435 1 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.5                                                                                            │
│ I1027 18:14:49.169905 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f162e000000' with size 268435456                                                    │
│ I1027 18:14:49.170365 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864                                                                 │
│ I1027 18:14:50.788433 1 model_repository_manager.cc:1022] loading: detectron2_python:1                                                                                         │
│ I1027 18:15:02.811446 1 python.cc:1529] Using Python execution env /tmp/folderiuMPlh/python-3-8.tar.gz                                                                         │
│ I1027 18:15:02.812793 1 python.cc:1796] TRITONBACKEND_ModelInstanceInitialize: detectron2_python_0 (GPU device 0)                                                              │
│ /opt/tritonserver                                                                                                                                                              │
│ total 2984         
 -rw-rw-r-- 1 triton-server triton-server    1485 Sep 23 17:32 LICENSE                                                                                                          │
│ -rw-rw-r-- 1 triton-server triton-server 3012640 Sep 23 17:32 NVIDIA_Deep_Learning_Container_License.pdf                                                                       │
│ -rw-rw-r-- 1 triton-server triton-server       7 Sep 23 17:32 TRITON_VERSION                                                                                                   │
│ drwxr-xr-x 1 triton-server triton-server    4096 Sep 23 18:18 backends                                                                                                         │
│ drwxr-xr-x 2 triton-server triton-server    4096 Sep 23 18:17 bin                                                                                                              │
│ drwxr-xr-x 1 triton-server triton-server    4096 Sep 23 18:17 include                                                                                                          │
│ drwxr-xr-x 2 triton-server triton-server    4096 Sep 23 18:17 lib                                                                                                              │
│ -rwxrwxr-x 1 triton-server triton-server    3982 Sep 23 17:32 nvidia_entrypoint.sh                                                                                             │
│ drwxr-xr-x 3 triton-server triton-server    4096 Sep 23 18:20 repoagents                                                                                                       │
│ drwxr-xr-x 2 triton-server triton-server    4096 Sep 23 18:17 third-party-src  

Any suggestions? Thanks!

@Tabrizian
Copy link
Member

There is a known issue in Python backend with regard to the polling mode. Does it work properly when you do not use the --model-control-mode=poll flag?

@stella-ds
Copy link

stella-ds commented Oct 27, 2021

thanks for replying @Tabrizian tried with --model-control-mode="none" and remove the --model-control-mode="none" flag, produces the same logs and stuck

root@hyperplane-triton-gpu-78d8f84688-s7spv:/opt/tritonserver# /opt/tritonserver/bin/tritonserver --model-store=gs://moichor-hyperplane/triton/model_repository_gpu --model-control-mode="none"
I1027 23:03:17.731165 55 metrics.cc:290] Collecting metrics for GPU 0: Tesla P4
I1027 23:03:18.024020 55 libtorch.cc:1030] TRITONBACKEND_Initialize: pytorch
I1027 23:03:18.024055 55 libtorch.cc:1040] Triton TRITONBACKEND API version: 1.5
I1027 23:03:18.024061 55 libtorch.cc:1046] 'pytorch' TRITONBACKEND API version: 1.5
2021-10-27 23:03:18.168753: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I1027 23:03:18.217710 55 tensorflow.cc:2170] TRITONBACKEND_Initialize: tensorflow
I1027 23:03:18.217753 55 tensorflow.cc:2180] Triton TRITONBACKEND API version: 1.5
I1027 23:03:18.217760 55 tensorflow.cc:2186] 'tensorflow' TRITONBACKEND API version: 1.5
I1027 23:03:18.217767 55 tensorflow.cc:2210] backend configuration:
{}
I1027 23:03:18.219686 55 onnxruntime.cc:1997] TRITONBACKEND_Initialize: onnxruntime
I1027 23:03:18.219722 55 onnxruntime.cc:2007] Triton TRITONBACKEND API version: 1.5
I1027 23:03:18.219729 55 onnxruntime.cc:2013] 'onnxruntime' TRITONBACKEND API version: 1.5
I1027 23:03:18.241311 55 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I1027 23:03:18.241359 55 openvino.cc:1203] Triton TRITONBACKEND API version: 1.5
I1027 23:03:18.241366 55 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.5
I1027 23:03:18.558964 55 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7ff5d0000000' with size 268435456
I1027 23:03:18.559431 55 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1027 23:03:20.197244 55 model_repository_manager.cc:1022] loading: detectron2_python:1
I1027 23:03:29.449002 55 python.cc:1529] Using Python execution env /tmp/folder3GM7qk/python-3-8.tar.gz
I1027 23:03:29.450132 55 python.cc:1796] TRITONBACKEND_ModelInstanceInitialize: detectron2_python_0 (GPU device 0)
/opt/tritonserver
total 2984
-rw-rw-r-- 1 triton-server triton-server    1485 Sep 23 17:32 LICENSE
-rw-rw-r-- 1 triton-server triton-server 3012640 Sep 23 17:32 NVIDIA_Deep_Learning_Container_License.pdf
-rw-rw-r-- 1 triton-server triton-server       7 Sep 23 17:32 TRITON_VERSION
drwxr-xr-x 1 triton-server triton-server    4096 Sep 23 18:18 backends
drwxr-xr-x 2 triton-server triton-server    4096 Sep 23 18:17 bin
drwxr-xr-x 1 triton-server triton-server    4096 Sep 23 18:17 include
drwxr-xr-x 2 triton-server triton-server    4096 Sep 23 18:17 lib
-rwxrwxr-x 1 triton-server triton-server    3982 Sep 23 17:32 nvidia_entrypoint.sh
drwxr-xr-x 3 triton-server triton-server    4096 Sep 23 18:20 repoagents
drwxr-xr-x 2 triton-server triton-server    4096 Sep 23 18:17 third-party-src

@Tabrizian
Copy link
Member

Strange.. As I mentioned earlier, we have not tested Execution Environments with GCS in Python backend. I have filed a ticket to add testing for it.

cc @msalehiNV

@tanmayv25
Copy link
Contributor

@Tabrizian Can you attach appropriate labels to this issue and link it to the ticket you have created?

@Tabrizian Tabrizian added the bug Something isn't working label Dec 7, 2021
@dyastremsky
Copy link
Contributor

Closing this issue due to its age and inactivity. Triton has changed a lot in the last two years. There was also a recent feature added that now allows custom execution environments here, if helpful.

For any issues in Triton's recent releases, please open a new issue following the bug template. We need clear steps to reproduce your issue, including models and filed needed. If you cannot share your model, feel free to use a sample model which reproduces the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

7 participants