Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ORCA Format KV Cache Utilization in Inference Response Header #7839

Open
wants to merge 2 commits into
base: r24.10
Choose a base branch
from

Conversation

BenjaminBraunDev
Copy link

@BenjaminBraunDev BenjaminBraunDev commented Nov 27, 2024

What does the PR do?

This PR adds code to HTTPAPIServer::GenerateRequestClass::StartResponse inside src/http_server.cc to add both kv_cache_utilization and max_token_capacity metrics composed from the existing prometheus metrics in TensorRT-LLM Backend's nv_trt_llm_kv_cache_block_metrics metric family.

This is acomplished by parsing the serialized prometheus metrics text object provided to the Triton Sever frontend by the Triton Core libraries into a structured vector of metrics for a specific metric family.

Checklist

  • I have read the Contribution guidelines and signed the Contributor License
    Agreement
  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • I ran pre-commit locally (pre-commit install, pre-commit run --all)
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Where should the reviewer start?

Changes are contained to 2 files:

  • src/http_server.cc
  • src/http_server.h (the former's header file)

The changes start in HTTPAPIServer::GenerateRequestClass::StartResponse() where the environment variable is checked and the header is written. There are 3 other funcitons below: MetricFamilyExtractor() which parses serialized prometheus metrics into a vector of PromMetric (which have a map of their metric labels), and ExtractKVMetrics() that pulls the values from the structured metrics and calculates the composite kv metrics, and finally OrcaKVMetricHeader() which forms the metrics into an endpoint-load-metrics header in the ORCA format specified by ORCA_METRIC_FORMAT. If there are no TensorRT-LLM Backend metrics, no metrics found for the header, or an invalid format type the header is simply not written.

The valid values for ORCA_METRIC_FORMAT are documented in the feature request (related issue linked below) and comments in StartResponse()

Test plan:

The feature is gated behind a feature flag in the form of the ORCA_METRIC_FORMAT environment variable. If unset, the feature is effectively disabled. Beyond that, the changes have been manually tested to not cause issue if either the queried metrics are not present (such as if TensorRT-LLM is not being used as the backend), or if the ORCA header metric type is invalid. In either case, nothing is parsed and no header is written. All code changes are wrapped in an #ifdef and are only included if metrics are enabled during the Triton Server build.

Caveats:

  1. This feature only works on Triton Inference Server running with TensorRT-LLM Backend, as otherwise the KV-cache metrics are not included in the server metrics.

  2. This change only implements the kv-cache utilization metics, but the functions it adds allows other metrics to be added easily (including metrics that don't require TensorRT-LLM Backend).

Background

This doc captures the overall requirements for model servers to integrate with llm instance gateway. More details in the Feature Request below.

Related Issues:

Screenshots

Response header before changes (or if ORCA_METRIC_FORMAT environment variable is unset):
orca_before

Response header with ORCA_METRIC_FORMAT="json":
orca_json

Response header with ORCA_METRIC_FORMAT="http":
orca_http

cc @yinggeh @krishung5 @jbkyang-nvi

@BenjaminBraunDev BenjaminBraunDev changed the title ORCA Format KV Cache Utilization in Inference Response Header feat: ORCA Format KV Cache Utilization in Inference Response Header Dec 10, 2024
… for use in HandleGenerate to add kv_utilization and max_token_capacity to the inference request response header.
@BenjaminBraunDev BenjaminBraunDev marked this pull request as ready for review December 11, 2024 02:21
src/http_server.h Outdated Show resolved Hide resolved
src/http_server.h Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
src/http_server.cc Outdated Show resolved Hide resolved
…nctionality to HTTPAPIServer::GenerateRequestClass::StartResponse() to extract metrics after inference request is processed for up-to-date metrics.
@nnshah1 nnshah1 requested a review from indrajit96 December 13, 2024 03:13
@jbkyang-nvi
Copy link
Contributor

@nnshah1 @indrajit96 are we merging into the already frozen r24.10? Or should this be onto main instead?

@nnshah1
Copy link
Contributor

nnshah1 commented Dec 13, 2024

@nnshah1 @indrajit96 are we merging into the already frozen r24.10? Or should this be onto main instead?

good catch we should target main

@BenjaminBraunDev
Copy link
Author

@jbkyang-nvi @nnshah1 Is Ubuntu 24.04 required to build the main branch, with container version 24.11? I ported my changes over but am having trouble building the server image. I see in the nvidia container support matrix Ubuntu 24.04 is required for CUDA Deep-Learning base container for 24.11, but I also see in the Release 2.52.0 Known Issues that TRT-LLM backend is built from TensorRT-LLM version 0.15.0 and built out of nvcr.io/nvidia/tritonserver:24.10-py3-min.

To build, I'm using a base image of nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 as a base image (ran into errors using 24.11 base image) but I'm getting an error while building target grpc-repo without much of an error log. Is this the right base image or should I update my Ubuntu and and use 24.11?

For context the full build command I'm using is:

TRTLLM_BASE_IMAGE=nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
              --enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
              --filesystem=gcs \
              --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
              --backend=ensemble --enable-gpu --no-container-pull \
              --repoagent=checksum --cache=local --cache=redis \
              --image=base,${TRTLLM_BASE_IMAGE}

@BenjaminBraunDev
Copy link
Author

BenjaminBraunDev commented Dec 20, 2024

Follow up, I got the image built, but when trying to run it I get a version error:

/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.36' not found (required by /opt/tritonserver/bin/tritonserver)
/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /opt/tritonserver/bin/tritonserver)
/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /opt/tritonserver/bin/tritonserver)
/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found (required by /opt/tritonserver/bin/../lib/libtritonserver.so)
/opt/tritonserver/bin/tritonserver: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /opt/tritonserver/bin/../lib/libtritonserver.so)

From what I looked up I don't think I can get GLIBC_2.36 (or higher) on Ubuntu 22.04?

@nnshah1
Copy link
Contributor

nnshah1 commented Dec 21, 2024

@nv-kmcgill53 , @nvda-mesharma

After the holiday can we help with updated instructions on building? I think we are moving everything to 24.04 - but may be mistaken.

@indrajit96
Copy link
Contributor

indrajit96 commented Jan 6, 2025

@BenjaminBraunDev @liu-cong Could you add some tests to help with the review?
A few examples of tests are here
https://github.com/triton-inference-server/server/tree/main/qa/L0_http
You can also have a simple python script that does the test with triton running in the background.
Also can you post the latest build issue you are facing after rebasing it to main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants