-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Objective
To provide detailed and insightful trace & event in the IGW that is aligned with the llm-d distributed tracing proposal.
Requirements
- Aligns with llm-d distributed tracing proposal: proposal.
- The trace & event solution should follow OSS standard, which is the Semantic conventions for GenAI spans: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/.
- The trace span should cover critical steps in the request handling logic in the IGW, with each critical step in an individual span and useful information in attributes. Not every internal function calls and parameter will be covered.
- Vendor Agnostic.
- The request tracing and prompt/response logging is opt-in.
Definition
The trace is going to track the end-to-end lifecycle of a request in IGW, the definition of end-to-end
is:
From
“Inference Gateway external processor starts processing the request”
to
“Inference Gateway external processor returns the request to users”.
Out of scope
- Multi-modal support is not in the scope, can be a follow-up of the project's support for Multi-modal server.
Request Flow
Reference https://gateway-api-inference-extension.sigs.k8s.io/ for the request flow.

This proposal covers step 2, 3, and 4.
Detailed breakdown of span (with attributes) in EPP
The OpenTelemetry community recently merged a change that updated the span and event attributes for GenA. The proposal below will follow the latest semantic convention. The order of steps is the parent-child hierarchy of spans.
Common attributes
- service.name: gateway-api-inference-extension.
- service.version: Release version of the Inference Gateway.
- gen_ai.request.model: Model name in the request.
- gateway.inferenobjective: The associated inferenceObjective to the request, a default one will be set if no one is found. The default infObjective will have the name
default_inferenceObjective
. - gateway.inferencepool: The name of inferencepool received the forwarded traffic.
- gateway.streaming: Boolean value that indicates the streaming request.
- gateway.model.rewrite: Value of x-gateway-model-name-rewrite in request header.
Step 1: Ext_proc starts processing the request
func (s *StreamingServer) Process(srv extProcPb.ExternalProcessor_ProcessServer) error
This will be the root span of the request handling process, as the Process function is the entry point of all requests in EPP.
Span Name: gateway.ext_proc.epp.request
Attributes:
- gen_ai.usage.input_tokens: The number of tokens used in the GenAI input (prompt).
- gen_ai.usage.output_tokens: The number of tokens used in the GenAI response (completion).
Step 2: Director HandleRequest orchestrates the request lifecycle
This process:
- Parses request details.
- Calls
admitRequest
for admission control. - Calls
Scheduler.Schedule
if request is approved. - Calls
prepareRequest
to populate RequestContext with result and call PreRequest plugins.
Span Name: gateway.request_orchestration
Attributes:
- target_model: The resolved target model name.
- request_criticality: The criticality of the request.
[WIP] Step 3: Queueing/fairness layer
Under development, more details will be added.
Span Name: gateway.queueing_fairness
Attributes: TBD
Step 4: Scheduling (plugin)
This span provides detailed insights into the Scheduling Subsystem of EPP.
Span Name: gateway.scheduling
Attributes:
- gateway.scheduling.pod: The Inference Server pod under inferencepool that is scheduled to serve the request.
- gateway.scheduling.profile: The selected profile to schedule the request.
- server.address: GenAI server address of the selected pod.
An example of plugins configmap:
apiVersion: v1
data:
default-plugins.yaml: |
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: queue-scorer
- type: kv-cache-utilization-scorer
- type: prefix-cache-scorer
schedulingProfiles:
- name: default
plugins:
- pluginRef: queue-scorer
- pluginRef: kv-cache-utilization-scorer
- pluginRef: prefix-cache-scorer
kind: ConfigMap
Scheduling subsystem is a system that its architecture allows for a pluggable scheduling algorithm. EPP should pass the context with tracing metadata to the child process. Scheduler select Scheduler Profiles iteratively, each scheduler plugin execution should be treated as a separate child span and connected under the same parent span of gateway.ext_proc.scheduling. A plugin can be either Filter
, Score
, or Pick
.
I propose a naming convention here for span name of plugins:
scheduling_plugin_<plugin name>
Step 5: Post response
Post response happens after receiving response from model servers. (Code link). It also contributes to the end-to-end latency of the request because it’s invoked under ext_proc HandleResponseXXX functions. Similar to Scheduling plugins, EPP is also going to iterate over all registered plugins so each plugin’s execution will be in a separate span. If the model server trace is enabled, this span will be appended after the model server’s spans.
Span Name: gateway.post_response
Child Span Name: post_response_plugin_<plugin name>
Detailed breakdown of span for BBR (Body based routing)
Common attributes
- service.name: gateway-api-inference-extension.
- service.version: Release version of the Inference Gateway.
- gen_ai.request.model: Model name in the request.
- gateway.inferencepool: The name of inferencepool received the forwarded traffic.
- gateway.streaming: Boolean value that indicates the streaming request.
Step 1: Ext_proc starts processing the request
Similar to EPP, this is the entrypoint of the request and the root span of the trace.
Span Name: gateway.ext_proc.bbr.request
Attributes:
- gen_ai.usage.input_tokens: The number of tokens used in the GenAI input (prompt).
- gen_ai.usage.output_tokens: The number of tokens used in the GenAI response (completion).
BBR has a thinner layer compared to the EPP, a single root span with all required attributes and status should be sufficient to start with.
Prompt/response logging
This function is going to provide correlated events (logs) with the spans collected above. It follows GenAI Events Semantic Convention: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/.
When we are using the OTel SDK to collect the trace span, we also capture the input and output from IGW that can be used by correlation in many backends. For example:
https://grafana.com/docs/grafana/latest/datasources/tempo/traces-in-grafana/trace-correlations/
In addition to signal correlation, the Prompt/response events can be used for auditing, evaluation, and compliance purposes.
Required Attributes:
- gen_ai.input.messages: GenAI input (prompt).
- gen_ai.output.messages: GenAI response (completion).
Common attributes like service name and server endpoint can also be added to the event if needed.
Connection with llm-d
Collaborate with the llm-d community to see how to populate such attributes. In addition to populating required attributes, the IGW should also be responsible for context propagation to preserve end-to-end trace continuity.
To prevent duplicate trace value / attributes, attributes that originate from IGW should be defined in IGW. We can use the OTel SDK environment variable to inject customized attributes if needed.
Trace and Prompt/response events enablement
The OTel SDK initialization will watch for the environment variable OTEL_EXPORTER_OTLP_ENDPOINT
as the signal of setting up the trace and event providers.
For Prompt/response events, IGW will watch for another environment variable OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT
which becomes the common env var for content recording in OpenTelemetry.
Other flags
Sampling
By default, the sampler parentbased_traceidratio
will be selected. The IGW will read standard environment variables OTEL_TRACES_SAMPLER_ARG
to set the sampling rate.