Simple demo to introduce a standard way for model developers to incorporate Python based projects in a standard way. In addition, this code will demonstrate how Users/Developers can easily deploy MONAI inference code for field testing. Finally, the code will demonstrate a method to do low latency classification and validation inference with Triton.
The steps below describe how to set-up a model repository, pull the Triton container, launch the Triton inference server, and then send inference requests to the running server.
This demo and description borrows heavily from the Triton Python Backend repo. The demo assumes you have at least one GPU
Pull down the demo repository and start with the [Quick Start] (#quick-start) guide.
$ git clone https://github.com/Project-MONAI/tutorials.git
The Triton backend for Python. The goal of Python backend is to let you serve models written in Python by Triton Inference Server without having to write any C++ code. We will use this to demonstrate implementing MONAI code inside Triton.
- Build Triton Container Image and Copy Model repository files using shell script
$ ./triton_build.sh
- Run Triton Container Image in Background Terminal using provided shell script The supplied script will start the demo container with Triton and expose the three ports to localhost needed for the application to send inference requests.
$ ./run_triton_local.sh
- Install environment for client The client environment should have Python 3 installed and should have the necessary packages installed.
$ python3 -m pip install -r requirements.txt
- Other dependent libraries for the Python Triton client are available as a Python packages
$ pip install nvidia-pyindex
$ pip install tritonclient[all]
- Run the client program The client program will take an optional file input and perform classification on body parts using the MedNIST data set. A small subset of the database is included.
$ mkdir -p client/test_data/MedNist
$ python -u client/client_mednist.py client/test_data/MedNist
Alternatively, the user can just run the shell script provided the previous steps 1 -4 in the Quick Start were followed.
$ ./mednist_client_run.sh
The expected result is variety of classification results for body images and local inference times.
## Examples:
The example demonstrates running a Triton Python Backend on a single image classification problem.
1. First, a Dockerfile and build script is used to build a container to Run the Triton Service and copy the model specific files in the container.
```Dockerfile:
# use desired Triton container as base image for our app
FROM nvcr.io/nvidia/tritonserver:21.04-py3
# create model directory in container
RUN mkdir -p /models/monai_covid/1
# install project-specific dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
RUN rm requirements.txt
# copy contents of model project into model repo in container image
COPY models/monai_covid/config.pbtxt /models/monai_covid
COPY models/monai_covid/1/model.py /models/monai_covid/1
ENTRYPOINT [ "tritonserver", "--model-repository=/models"]
Note: The Triton service expects a certain directory structure discussed in Model Config File to load the model definitions.
- Next, the container with the Triton Service runs as a service (in background or separate terminal for demo). In this example, the ports used by the Triton Service are set to
8000
for client communications.
demo_app_image_name="monai_triton:demo"
docker run --shm-size=128G --rm -p 127.0.0.1:8000:8000 -p 127.0.0.1:8001:8001 -p 127.0.0.1:8090:8002 ${demo_app_image_name}
- See Model Config File to see the expected file structure for Triton.
- Modify the models/monai_prostrate/1/model.py file to satisfy any model configuration requirements while keeping the required components in the model definition. See the * Usage section for background.
- In the models/monai_prostrate/1/config.pbtxt file configure the number of GPUs and which ones are used. e.g. Using two available GPUs and two parallel versions of the model per GPU
instance_group [
{
kind: KIND_GPU
count: 2
gpus: [ 0, 1 ]
}
e.g. Using three of four available GPUs and four parallel versions of the model per GPU
instance_group [
{
kind: KIND_GPU
count: 4
gpus: [ 0, 1, 3 ]
}
Also, other configurations like dynamic batching and corresponding sizes can be configured. See the Triton Service Documentation model configurations documentation for more information.
- Finally, be sure to include Tensors or Torchscript definition *.ts files in the directory structure. In this example, a COVID19 classificatiion model based in PyTorch is used.
covid19_model.ts
The Dockerfile will copy the model definition structure into the Triton container Service. When the container is run, the python backend implementation will pull the covid19_model.ts file from a Google Drive for the demo. So the container should be rebuilt after any modifications to the GPU configuration or model configurations for the example.
- A Python client program configures the model and makes an http request to Triton as a Service. Note: Triton supports other interfaces like gRPC. The client reads an input image converted from Nifti to a byte array for classification.
- In this example, a model trained to detect COVID-19 is provided an image with COVID or without.
filename = 'client/test_data/volume-covid19-A-0000.nii.gz'
- The client calls the Triton Service using the external port configured previously.
with httpclient.InferenceServerClient("localhost:8000") as client:
- The Triton inference response is returned :
response = client.infer(model_name,
inputs,
request_id=str(uuid4().hex),
outputs=outputs,)
result = response.get_response()
- Added to this demo as alternate demo using the MedNIST dataset in a classification example.
- To run the MedNIST example use the same steps as shown in the Quick Start with the following changes at step 5.
- Run the client program (for the MedNIST example) The client program will take an optional file input and perform classification on body parts using the MedNIST data set. A small subset of the database is included.
$ mkdir -p client/test_data/MedNist
$ python -u client/client_mednist.py client/test_data/MedNist
Alternatively, the user can just run the shell script provided the previous steps 1 -4 in the Quick Start were followed.
$ ./mednist_client_run.sh
The expected result is variety of classification results for body images and local inference times.
- The requirements.txt file is used to place requirements into the Triton Server Container, but also for the client environment.
- Take care with the version of PyTorch (torch) used based on the specific GPU and installed driver versions. The --extra-index-url flag may need to be modified to correspond with the CUDA version installed on the local GPU.
- Determine your driver and CUDA version with the following command:
nvidia-smi
- Then choose the appropriate library to load for PyTorch by adding the helper flag in the
requirements.txt
file.
--extra-index-url https://download.pytorch.org/whl/cu116
- Note: in the above example the cu116 instructs to install the latest torch version that supports CUDA 11.6
See Triton Inference Server/python_backend documentation