[BUG and QUESTION] Merlin inference container 22.03 from NGC not working #142

leiterenato · 2022-03-10T12:04:41Z

BUG

There is no tritonserver installed on this container (merlin-inference:22.03): https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-inference.

root@c29b08874c52:/# tritonserver
bash: tritonserver: command not found
root@c29b08874c52:/# cd opt/tritonserver/
root@c29b08874c52:/opt/tritonserver# ls
backends

To reproduce this error, just start a new merlin-inference container (22.03) and try to invoke the tritonserver.
How should I invoke tritonserver with this new container?

QUESTION
I am trying to load an ensemble model trained with HugeCTR to triton.
I am using the following script to start the server with the ensemble:

#!/bin/bash

if [ -z "$1"]
  then
    MODEL_REPOSITORY=/model
  else
    MODEL_REPOSITORY=$1
fi

echo "Copying model ensemble to local folder"
mkdir ${MODEL_REPOSITORY} 
gsutil -m cp -r ${GCP_PATH}/* ${MODEL_REPOSITORY}

# Create an empty folder for ensemble
ENSEMBLE_DIR=$(ls ${MODEL_REPOSITORY} | grep ens)
mkdir ${MODEL_REPOSITORY}/${ENSEMBLE_DIR}/1 

echo "Starting Triton Server"
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 tritonserver --allow-vertex-ai=false --model-repository=$MODEL_REPOSITORY \
--backend-config=hugectr,ps=$MODEL_REPOSITORY/ps.json

Am I starting the container correctly?
I really appreciate any help.

The text was updated successfully, but these errors were encountered:

rnyak · 2022-03-10T20:19:23Z

@leiterenato thanks for reporting the issue. we are currently looking into that issue.

@albert17 fyi.

albert17 · 2022-03-10T21:03:16Z

Hi @leiterenato

This issue has been identified. Working on it: #135

Will provide an updated nightly container today or tomorrow with latest changes included.

Sorry for the problems.

leiterenato · 2022-03-10T21:52:00Z

Thank you!

viswa-nvidia · 2022-03-15T19:27:41Z

@albert17 , will the linked PR release the inference containers with tritonclient ?

albert17 · 2022-03-16T22:12:19Z

Nightly container updated. Please try @leiterenato

docker pull nvcr.io/nvidia/merlin/merlin-inference:nightly

leiterenato · 2022-03-16T22:16:16Z

Thanks @albert17.
I will try and post the result here.

leiterenato · 2022-03-17T13:53:19Z

@albert17
I am receiving this error:
"tritonserver: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory"

Is there a way to export a variable in my Dockerfile to solve this?

leiterenato · 2022-03-17T16:49:02Z

Hi @albert17,
At least for me, tritonserver is still not working:

root@6419ea59e1f0:/src# tritonserver 
tritonserver: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory

Could you please verify?

Thank you

albert17 · 2022-03-17T23:36:01Z

@leiterenato I have updated dockerfile and containers.

Try again with nvcr.io/nvidia/merlin/merlin-inference:nightly or nvcr.io/nvidia/merlin/merlin-inference:22.03

This should not happen anymorre

albert17 · 2022-03-17T23:37:21Z

albertoa@pursuit-dgxstation:~/Projects/blossom/merlin/build$ docker run --pull always --gpus=all -it --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-inference:22.03 /bin/bash
22.03: Pulling from nvidia/merlin/merlin-inference
Digest: sha256:e0fc192e46308714595f8abc6c408bcfcf5bb6ec9145febaef4465e00c69ca20
Status: Image is up to date for nvcr.io/nvidia/merlin/merlin-inference:22.03

==========
== CUDA ==
==========

NVIDIA Release  (build )
CUDA Version 11.6.0.021

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.6 driver version 510.39.01 with kernel driver version 470.57.02.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

root@b473cae12035:/opt/tritonserver# tritonserver
I0317 23:36:37.385408 101 metrics.cc:623] Collecting metrics for GPU 0: Tesla V100-DGXS-16GB
I0317 23:36:37.385716 101 metrics.cc:623] Collecting metrics for GPU 1: Tesla V100-DGXS-16GB
I0317 23:36:37.385745 101 metrics.cc:623] Collecting metrics for GPU 2: Tesla V100-DGXS-16GB
I0317 23:36:37.385769 101 metrics.cc:623] Collecting metrics for GPU 3: Tesla V100-DGXS-16GB
I0317 23:36:37.387029 101 tritonserver.cc:1932]
+----------------------------------+------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                      |
+----------------------------------+------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                     |
| server_version                   | 2.19.0                                                                                                     |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configu |
|                                  | ration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace                         |
| model_control_mode               | MODE_NONE                                                                                                  |
| strict_model_config              | 1                                                                                                          |
| rate_limit                       | OFF                                                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                                                  |
| response_cache_byte_size         | 0                                                                                                          |
| min_supported_compute_capability | 6.0                                                                                                        |
| strict_readiness                 | 1                                                                                                          |
| exit_timeout                     | 30                                                                                                         |
+----------------------------------+------------------------------------------------------------------------------------------------------------+

I0317 23:36:37.387073 101 server.cc:249] No server context available. Exiting immediately.
error: creating server: Invalid argument - --model-repository must be specified
^C
root@b473cae12035:/opt/tritonserver#

karlhigley assigned albert17 Mar 10, 2022

albert17 linked a pull request Mar 10, 2022 that will close this issue

Fix inference container #135

Merged

viswa-nvidia added this to the Merlin 22.04 milestone Mar 15, 2022

albert17 closed this as completed in #135 Mar 17, 2022

karlhigley mentioned this issue Mar 28, 2022

[BUG] hugectr not available in merlin-inference and merlin-tensorflow-training #161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG and QUESTION] Merlin inference container 22.03 from NGC not working #142

[BUG and QUESTION] Merlin inference container 22.03 from NGC not working #142

leiterenato commented Mar 10, 2022

rnyak commented Mar 10, 2022

albert17 commented Mar 10, 2022

leiterenato commented Mar 10, 2022

viswa-nvidia commented Mar 15, 2022

albert17 commented Mar 16, 2022

leiterenato commented Mar 16, 2022

leiterenato commented Mar 17, 2022

leiterenato commented Mar 17, 2022

albert17 commented Mar 17, 2022

albert17 commented Mar 17, 2022

[BUG and QUESTION] Merlin inference container 22.03 from NGC not working #142

[BUG and QUESTION] Merlin inference container 22.03 from NGC not working #142

Comments

leiterenato commented Mar 10, 2022

rnyak commented Mar 10, 2022

albert17 commented Mar 10, 2022

leiterenato commented Mar 10, 2022

viswa-nvidia commented Mar 15, 2022

albert17 commented Mar 16, 2022

leiterenato commented Mar 16, 2022

leiterenato commented Mar 17, 2022

leiterenato commented Mar 17, 2022

albert17 commented Mar 17, 2022

albert17 commented Mar 17, 2022