Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG and QUESTION] Merlin inference container 22.03 from NGC not working #142

Closed
leiterenato opened this issue Mar 10, 2022 · 10 comments · Fixed by #135
Closed

[BUG and QUESTION] Merlin inference container 22.03 from NGC not working #142

leiterenato opened this issue Mar 10, 2022 · 10 comments · Fixed by #135
Assignees
Milestone

Comments

@leiterenato
Copy link

BUG

There is no tritonserver installed on this container (merlin-inference:22.03): https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-inference.

root@c29b08874c52:/# tritonserver
bash: tritonserver: command not found
root@c29b08874c52:/# cd opt/tritonserver/
root@c29b08874c52:/opt/tritonserver# ls
backends

To reproduce this error, just start a new merlin-inference container (22.03) and try to invoke the tritonserver.
How should I invoke tritonserver with this new container?

QUESTION
I am trying to load an ensemble model trained with HugeCTR to triton.
I am using the following script to start the server with the ensemble:

#!/bin/bash

if [ -z "$1"]
  then
    MODEL_REPOSITORY=/model
  else
    MODEL_REPOSITORY=$1
fi

echo "Copying model ensemble to local folder"
mkdir ${MODEL_REPOSITORY} 
gsutil -m cp -r ${GCP_PATH}/* ${MODEL_REPOSITORY}

# Create an empty folder for ensemble
ENSEMBLE_DIR=$(ls ${MODEL_REPOSITORY} | grep ens)
mkdir ${MODEL_REPOSITORY}/${ENSEMBLE_DIR}/1 

echo "Starting Triton Server"
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 tritonserver --allow-vertex-ai=false --model-repository=$MODEL_REPOSITORY \
--backend-config=hugectr,ps=$MODEL_REPOSITORY/ps.json 

Am I starting the container correctly?
I really appreciate any help.

@rnyak
Copy link
Contributor

rnyak commented Mar 10, 2022

@leiterenato thanks for reporting the issue. we are currently looking into that issue.

@albert17 fyi.

@albert17
Copy link
Contributor

Hi @leiterenato

This issue has been identified. Working on it: #135

Will provide an updated nightly container today or tomorrow with latest changes included.

Sorry for the problems.

@albert17 albert17 linked a pull request Mar 10, 2022 that will close this issue
@leiterenato
Copy link
Author

Thank you!

@viswa-nvidia
Copy link

@albert17 , will the linked PR release the inference containers with tritonclient ?

@viswa-nvidia viswa-nvidia added this to the Merlin 22.04 milestone Mar 15, 2022
@albert17
Copy link
Contributor

Nightly container updated. Please try @leiterenato

docker pull nvcr.io/nvidia/merlin/merlin-inference:nightly

@leiterenato
Copy link
Author

Thanks @albert17.
I will try and post the result here.

@leiterenato
Copy link
Author

@albert17
I am receiving this error:
"tritonserver: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory"

Is there a way to export a variable in my Dockerfile to solve this?

@leiterenato
Copy link
Author

Hi @albert17,
At least for me, tritonserver is still not working:

root@6419ea59e1f0:/src# tritonserver 
tritonserver: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory

Could you please verify?

Thank you

@albert17
Copy link
Contributor

@leiterenato I have updated dockerfile and containers.

Try again with nvcr.io/nvidia/merlin/merlin-inference:nightly or nvcr.io/nvidia/merlin/merlin-inference:22.03

This should not happen anymorre

@albert17
Copy link
Contributor

albertoa@pursuit-dgxstation:~/Projects/blossom/merlin/build$ docker run --pull always --gpus=all -it --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-inference:22.03 /bin/bash
22.03: Pulling from nvidia/merlin/merlin-inference
Digest: sha256:e0fc192e46308714595f8abc6c408bcfcf5bb6ec9145febaef4465e00c69ca20
Status: Image is up to date for nvcr.io/nvidia/merlin/merlin-inference:22.03

==========
== CUDA ==
==========

NVIDIA Release  (build )
CUDA Version 11.6.0.021

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.6 driver version 510.39.01 with kernel driver version 470.57.02.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

root@b473cae12035:/opt/tritonserver# tritonserver
I0317 23:36:37.385408 101 metrics.cc:623] Collecting metrics for GPU 0: Tesla V100-DGXS-16GB
I0317 23:36:37.385716 101 metrics.cc:623] Collecting metrics for GPU 1: Tesla V100-DGXS-16GB
I0317 23:36:37.385745 101 metrics.cc:623] Collecting metrics for GPU 2: Tesla V100-DGXS-16GB
I0317 23:36:37.385769 101 metrics.cc:623] Collecting metrics for GPU 3: Tesla V100-DGXS-16GB
I0317 23:36:37.387029 101 tritonserver.cc:1932]
+----------------------------------+------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                      |
+----------------------------------+------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                     |
| server_version                   | 2.19.0                                                                                                     |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configu |
|                                  | ration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace                         |
| model_control_mode               | MODE_NONE                                                                                                  |
| strict_model_config              | 1                                                                                                          |
| rate_limit                       | OFF                                                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                                                  |
| response_cache_byte_size         | 0                                                                                                          |
| min_supported_compute_capability | 6.0                                                                                                        |
| strict_readiness                 | 1                                                                                                          |
| exit_timeout                     | 30                                                                                                         |
+----------------------------------+------------------------------------------------------------------------------------------------------------+

I0317 23:36:37.387073 101 server.cc:249] No server context available. Exiting immediately.
error: creating server: Invalid argument - --model-repository must be specified
^C
root@b473cae12035:/opt/tritonserver#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants