- Purpose
- Benchmarking Tool
- Metrics Measured
- Prerequisites
- Running the Performance Benchmark
- Data Collection
This guide describes how to benchmark the inference performance (throughput and latency) of a deployed CodeGen service. The results help understand the service's capacity under load and compare different deployment configurations or models. This benchmark primarily targets Kubernetes deployments but can be adapted for Docker.
We use the GenAIEval tool for performance benchmarking, which simulates concurrent users sending requests to the service endpoint.
The benchmark reports several key performance indicators:
- Concurrency: Number of concurrent requests simulated.
- End-to-End Latency: Time from request submission to final response received (P50, P90, P99 in ms).
- End-to-End First Token Latency: Time from request submission to first token received (P50, P90, P99 in ms).
- Average Next Token Latency: Average time between subsequent generated tokens (in ms).
- Average Token Latency: Average time per generated token (in ms).
- Requests Per Second (RPS): Throughput of the service.
- Output Tokens Per Second: Rate of token generation.
- Input Tokens Per Second: Rate of token consumption.
- A running CodeGen service accessible via an HTTP endpoint. Refer to the main CodeGen README for deployment options (Kubernetes recommended for load balancing/scalability).
- If using Kubernetes:
- A working Kubernetes cluster (refer to OPEA K8s setup guides if needed).
kubectl
configured to access the cluster from the node where the benchmark will run (typically the master node).- Ensure sufficient
ulimit
for network connections on worker nodes hosting the service pods (e.g.,LimitNOFILE=65536
or higher in containerd/docker config).
- General:
- Python 3.8+ on the node running the benchmark script.
- Network access from the benchmark node to the CodeGen service endpoint.
-
Deploy CodeGen Service: Ensure your CodeGen service is deployed and accessible. Note the service endpoint URL (e.g., obtained via
kubectl get svc
or your ingress configuration if using Kubernetes, orhttp://{host_ip}:{port}
for Docker). -
Configure Benchmark Parameters (Optional): Set environment variables to customize the test queries and output directory. The
USER_QUERIES
variable defines the number of concurrent requests for each test run.# Example: Four runs with 128 concurrent requests each export USER_QUERIES="[128, 128, 128, 128]" # Example: Output directory export TEST_OUTPUT_DIR="/tmp/benchmark_output" # Set the target endpoint URL export CODEGEN_ENDPOINT_URL="http://{your_service_ip_or_hostname}:{port}/v1/codegen"
Replace
{your_service_ip_or_hostname}:{port}
with the actual accessible URL of your CodeGen gateway service. -
Execute the Benchmark Script: Run the script, optionally specifying the number of Kubernetes nodes involved if relevant for reporting context (the script itself runs from one node).
# Clone GenAIExamples if you haven't already # cd GenAIExamples/CodeGen/benchmark/performance bash benchmark.sh # Add '-n <node_count>' if desired for logging purposes
Ensure the
benchmark.sh
script is adapted to useCODEGEN_ENDPOINT_URL
and potentiallyUSER_QUERIES
,TEST_OUTPUT_DIR
.
Benchmark results will be displayed in the terminal upon completion. Detailed results, typically including raw data and summary statistics, will be saved in the directory specified by TEST_OUTPUT_DIR
(defaulting to /tmp/benchmark_output
). CSV files (e.g., 1_testspec.yaml.csv
) containing metrics for each run are usually generated here.