Developing on triton server #7782

coolkp · 2024-11-08T23:07:26Z

coolkp
Nov 8, 2024

Hi folks,
Happy friday. I am currently customising triton server to add some meta information to response headers. My changes are only in server, I only want to build server. I want to build my tritonserver and deploy with tensor rt llm backend and llama 3 model, Whats the best most efficient way to do this. Should i make my changes then run compose with min image of trtllm-python-py3. How do I use this image built by compose then to run my inference

coolkp · 2024-11-08T23:18:29Z

coolkp
Nov 8, 2024
Author

Building from source everything takes super long. So i think i am not doing this right

15 replies

coolkp Nov 11, 2024
Author

Where is the dockerfile, TRT-LLM container, i assume you mean, TRT-LLM container is nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 ? I understand this has trtllm engine+backend with python. And used as base/min image in compose it should be present in backends? but i don't see it. There is step to copy nto sure where to copy . I believe COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/pytorch /opt/tritonserver/backends/pytorch should be changed?

#
# Multistage build.
#
ARG TRITON_VERSION=2.51.0
ARG TRITON_CONTAINER_VERSION=24.10

FROM nvcr.io/nvidia/tritonserver:24.10-py3 AS full

FROM nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

ARG TRITON_VERSION
ARG TRITON_CONTAINER_VERSION

ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
LABEL com.nvidia.tritonserver.version="${TRITON_SERVER_VERSION}"

ENV PATH /opt/tritonserver/bin:${PATH}
# Remove once https://github.com/openucx/ucx/pull/9148 is available
# in the min container.
ENV UCX_MEM_EVENTS no

ENV LD_LIBRARY_PATH /opt/hpcx/ucc/lib/:/opt/hpcx/ucx/lib/:${LD_LIBRARY_PATH}

ENV TF_ADJUST_HUE_FUSED         1
ENV TF_ADJUST_SATURATION_FUSED  1
ENV TF_ENABLE_WINOGRAD_NONFUSED 1
ENV TF_AUTOTUNE_THRESHOLD       2
ENV TRITON_SERVER_GPU_ENABLED    1

# Create a user that can be used to run triton as
# non-root. Make sure that this user to given ID 1000. All server
# artifacts copied below are assign to this user.
ENV TRITON_SERVER_USER=triton-server
RUN userdel tensorrt-server > /dev/null 2>&1 || true \
      && if ! id -u $TRITON_SERVER_USER > /dev/null 2>&1 ; then \
          useradd $TRITON_SERVER_USER; \
        fi \
      && [ `id -u $TRITON_SERVER_USER` -eq 1000 ] \
      && [ `id -g $TRITON_SERVER_USER` -eq 1000 ]

# Ensure apt-get won't prompt for selecting options
ENV DEBIAN_FRONTEND=noninteractive

# Common dependencies. FIXME (can any of these be conditional? For
# example libcurl only needed for GCS?)
RUN apt-get update \
      && apt-get install -y --no-install-recommends \
              clang \
              curl \
              dirmngr \
              git \
              gperf \
              libb64-0d \
              libcurl4-openssl-dev \
              libgoogle-perftools-dev \
              libjemalloc-dev \
              libnuma-dev \
              libre2-9 \
              software-properties-common \
              wget \
              libgomp1 \
      && rm -rf /var/lib/apt/lists/*

# Set TCMALLOC_RELEASE_RATE for users setting LD_PRELOAD with tcmalloc
ENV TCMALLOC_RELEASE_RATE 200

ENV DCGM_VERSION 3.2.6
# Install DCGM. Steps from https://developer.nvidia.com/dcgm#Downloads
RUN curl -o /tmp/cuda-keyring.deb \
          https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb \
      && apt install /tmp/cuda-keyring.deb \
      && rm /tmp/cuda-keyring.deb \
      && apt-get update \
      && apt-get install -y datacenter-gpu-manager=1:3.2.6

# Extra defensive wiring for CUDA Compat lib
RUN ln -sf ${_CUDA_COMPAT_PATH}/lib.real ${_CUDA_COMPAT_PATH}/lib \
    && echo ${_CUDA_COMPAT_PATH}/lib > /etc/ld.so.conf.d/00-cuda-compat.conf \
    && ldconfig \
    && rm -f ${_CUDA_COMPAT_PATH}/lib

WORKDIR /opt/tritonserver
RUN rm -fr /opt/tritonserver/*
ENV NVIDIA_PRODUCT_NAME="Triton Server"
COPY docker/entrypoint.d/ /opt/nvidia/entrypoint.d/

ENV NVIDIA_BUILD_ID 117849258
LABEL com.nvidia.build.id=117849258
LABEL com.nvidia.build.ref=cd309d79dc2130ab890fd63def83fa8c17668393

WORKDIR /opt/tritonserver
COPY --chown=1000:1000 --from=full /opt/tritonserver/LICENSE .
COPY --chown=1000:1000 --from=full /opt/tritonserver/TRITON_VERSION .
COPY --chown=1000:1000 --from=full /opt/tritonserver/NVIDIA_Deep_Learning_Container_License.pdf .
COPY --chown=1000:1000 --from=full /opt/tritonserver/bin bin/
COPY --chown=1000:1000 --from=full /opt/tritonserver/lib lib/
COPY --chown=1000:1000 --from=full /opt/tritonserver/include include/
# Copying over backends 
COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/pytorch /opt/tritonserver/backends/pytorch

# Top-level /opt/tritonserver/backends not copied so need to explicitly set permissions here
RUN chown triton-server:triton-server /opt/tritonserver/backends
#  Copying over repoagents 
#  Copying over caches 

LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
COPY --chown=1000:1000 --from=full /usr/bin/serve /usr/bin/.

coolkp Nov 11, 2024
Author

@krishung5

krishung5 Nov 12, 2024
Collaborator

And used as base/min image in compose it should be present in backends? but i don't see it

As mentioned above, compose.py doesn't support TRT-LLM backend, so I doubt using this way would work.

I believe COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/pytorch /opt/tritonserver/backends/pytorch should be changed

I think modify this line to copy the TRT-LLM backend should work. What I was suggesting was building a new container with your server changes, and copy the server libraries into the TRT-LLM container nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3.

coolkp Nov 12, 2024
Author

how to do this

What I was suggesting was building a new container with your server changes

and where to copy @krishung5

copy the server libraries into the TRT-LLM container

krishung5 Nov 12, 2024
Collaborator

@coolkp Please see my above comment. After building the server with build.py command, you will get an image named tritonserver. The server libraries are located in /opt/tritonserver/lib and /opt/tritonserver/bin. Just replace them in the TRT-LLM container nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 should work.

jeetwin-az · 2025-01-05T02:59:32Z

jeetwin-az
Jan 5, 2025

Hi!

To efficiently build and deploy your customized Triton server with the TensorRT LLM backend and Llama 3 model, here's a streamlined approach:

Customize Triton Server: Make your changes to the server code to add meta information to response headers.
Build the Server: Use docker-compose with the tritonserver base image to build your customized Triton server.
Use TensorRT LLM Backend: In your docker-compose.yml, specify the trtllm-python-py3 image as the base for the TensorRT LLM backend, which will handle inference.
Deploy: After building with docker-compose, you can run the inference by starting the server container and pointing it to your model and configuration files.
Run Inference: Once the Triton server is up, use the Triton client or API to send inference requests to the server.

Let me know if you need further details on any specific step!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developing on triton server #7782

{{title}}

Replies: 2 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Developing on triton server #7782

coolkp Nov 8, 2024

Replies: 2 comments · 15 replies

coolkp Nov 8, 2024 Author

coolkp Nov 11, 2024 Author

coolkp Nov 11, 2024 Author

krishung5 Nov 12, 2024 Collaborator

coolkp Nov 12, 2024 Author

krishung5 Nov 12, 2024 Collaborator

jeetwin-az Jan 5, 2025

coolkp
Nov 8, 2024

Replies: 2 comments 15 replies

coolkp
Nov 8, 2024
Author

coolkp Nov 11, 2024
Author

coolkp Nov 11, 2024
Author

krishung5 Nov 12, 2024
Collaborator

coolkp Nov 12, 2024
Author

krishung5 Nov 12, 2024
Collaborator

jeetwin-az
Jan 5, 2025