Proposal: Manual instrumentation for traces and logs with OTLP

## Objective ##

To provide detailed and insightful trace & event in the IGW that is aligned with the [llm-d distributed tracing proposal](https://github.com/llm-d/llm-d/pull/119/files).

## Requirements ##

- Aligns with llm-d distributed tracing proposal: [proposal](https://github.com/llm-d/llm-d/pull/119/files?short_path=82439cd#diff-82439cd802003ddf17ca5019d78338dca314dc2ae49d546d37f865f5e0ca6c2d).
- The trace & event solution should follow OSS standard, which is the Semantic conventions for GenAI spans: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/.
- The trace span should cover critical steps in the request handling logic in the IGW, with each critical step in an individual span and useful information in attributes. Not every internal function calls and parameter will be covered.
- Vendor Agnostic.
- The request tracing and prompt/response logging is opt-in.

## Definition ##

The trace is going to track the end-to-end lifecycle of a request in IGW, the definition of `end-to-end` is:

From

“Inference Gateway external processor starts processing the request”

to

“Inference Gateway external processor returns the request to users”. 

## Out of scope ##

- Multi-modal support is not in the scope, can be a follow-up of the project's support for Multi-modal server.

## Request Flow ##

Reference https://gateway-api-inference-extension.sigs.k8s.io/ for the request flow.

<img width="802" height="751" alt="Image" src="https://github.com/user-attachments/assets/3ddf5a26-dfe7-44d7-997f-adbbc43887f3" />

This proposal covers step 2, 3, and 4.

## Detailed breakdown of span (with attributes) in EPP ##

The OpenTelemetry community recently [merged a change](https://github.com/open-telemetry/semantic-conventions/pull/2179) that updated the span and event attributes for GenA. The proposal below will follow the latest semantic convention. The order of steps is the parent-child hierarchy of spans.

### Common attributes ###

- service.name: gateway-api-inference-extension.
- service.version: Release version of the Inference Gateway.
- gen_ai.request.model: Model name in the request.
- gateway.inferenobjective: The associated inferenceObjective to the request, a default one will be set if no one is found. The default infObjective will have the name `default_inferenceObjective`.
- gateway.inferencepool: The name of inferencepool received the forwarded traffic.
- gateway.streaming: Boolean value that indicates the streaming request.
- gateway.model.rewrite: Value of x-gateway-model-name-rewrite in request header.

### Step 1: Ext_proc starts processing the request ###

```
func (s *StreamingServer) Process(srv extProcPb.ExternalProcessor_ProcessServer) error
```

This will be the **root span** of the request handling process, as the Process function is the **entry point** of all requests in EPP.

Span Name: gateway.ext_proc.epp.request

Attributes:
- gen_ai.usage.input_tokens: The number of tokens used in the GenAI input (prompt).	
- gen_ai.usage.output_tokens: The number of tokens used in the GenAI response (completion).

### Step 2: Director [HandleRequest](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/f1141f9c7ac2fd894378834f8d1fc12b7e6a95ad/pkg/epp/requestcontrol/director.go#L86) orchestrates the request lifecycle ###

This process:
1. Parses request details.
2. Calls `admitRequest` for admission control.
3. Calls `Scheduler.Schedule` if request is approved.
4. Calls `prepareRequest` to populate RequestContext with result and call PreRequest plugins.

Span Name: gateway.request_orchestration

Attributes:
- target_model: The resolved target model name.
- request_criticality: The criticality of the request.

## [WIP] Step 3: [Queueing/fairness](https://github.com/kubernetes-sigs/gateway-api-inference-extension/commits/main/) layer ##

Under development, more details will be added.

Span Name: gateway.queueing_fairness

Attributes: TBD

### Step 4: Scheduling (plugin) ###

This span provides detailed insights into the [Scheduling Subsystem](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/46b4553e17b66cdb370a1b7eb2d91dd352a8be95/docs/proposals/0845-scheduler-architecture-proposal) of EPP.

Span Name: gateway.scheduling

Attributes:
- gateway.scheduling.pod: The Inference Server pod under inferencepool that is scheduled to serve the request.
- gateway.scheduling.profile: The selected profile to schedule the request.
- server.address: GenAI server address of the selected pod.

An example of plugins configmap:
```
apiVersion: v1
data:
  default-plugins.yaml: |
    apiVersion: inference.networking.x-k8s.io/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: queue-scorer
    - type: kv-cache-utilization-scorer
    - type: prefix-cache-scorer
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: queue-scorer
      - pluginRef: kv-cache-utilization-scorer
      - pluginRef: prefix-cache-scorer
kind: ConfigMap
```

Scheduling subsystem is a system that its [architecture](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/f1141f9c7ac2fd894378834f8d1fc12b7e6a95ad/docs/proposals/0845-scheduler-architecture-proposal) allows for a pluggable scheduling algorithm. EPP should pass the context with tracing metadata to the child process. Scheduler select Scheduler Profiles [iteratively](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/scheduling/scheduler.go#L59-L81), each scheduler plugin execution should be treated as a separate child span and connected under the same parent span of  gateway.ext_proc.scheduling.  A plugin can be either `Filter`, `Score`, or `Pick`.

I propose a naming convention here for span name of plugins: 
```
scheduling_plugin_<plugin name>
```

### Step 5: Post response ###

Post response happens after receiving response from model servers. ([Code link](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/f1141f9c7ac2fd894378834f8d1fc12b7e6a95ad/pkg/epp/requestcontrol/director.go#L282-L284)). It also contributes to the end-to-end latency of the request because it’s invoked under ext_proc HandleResponseXXX functions. Similar to Scheduling plugins, EPP is also going to iterate over all registered plugins so each plugin’s execution will be in a separate span. If the model server trace is enabled, this span will be appended after the model server’s spans.

Span Name: gateway.post_response

Child Span Name: `post_response_plugin_<plugin name>`

## Detailed breakdown of span for BBR (Body based routing) ##

Common attributes

- service.name: gateway-api-inference-extension.
- service.version: Release version of the Inference Gateway.
- gen_ai.request.model: Model name in the request.
- gateway.inferencepool: The name of inferencepool received the forwarded traffic.
- gateway.streaming: Boolean value that indicates the streaming request.

### Step 1: Ext_proc starts processing the request ###

Similar to EPP, this is the entrypoint of the request and the root span of the trace.

Span Name: gateway.ext_proc.bbr.request

Attributes:
- gen_ai.usage.input_tokens: The number of tokens used in the GenAI input (prompt).	
- gen_ai.usage.output_tokens: The number of tokens used in the GenAI response (completion).

BBR has a thinner layer compared to the EPP, a single root span with all required attributes and status should be sufficient to start with.

## Prompt/response logging ##

This function is going to provide correlated events (logs) with the spans collected above. It follows GenAI Events Semantic Convention: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/.

When we are using the OTel SDK to collect the trace span, we also capture the input and output from IGW that can be used by correlation in many backends. For example:
https://grafana.com/docs/grafana/latest/datasources/tempo/traces-in-grafana/trace-correlations/

In addition to signal correlation, the Prompt/response events can be used for auditing, evaluation, and compliance purposes.

Required Attributes:
- gen_ai.input.messages: GenAI input (prompt).	
- gen_ai.output.messages: GenAI response (completion).

Common attributes like service name and server endpoint can also be added to the event if needed.

## Connection with llm-d ##

Collaborate with the [llm-d community](https://github.com/llm-d/llm-d/pull/119/files?short_path=82439cd#diff-82439cd802003ddf17ca5019d78338dca314dc2ae49d546d37f865f5e0ca6c2d) to see how to populate such attributes. In addition to populating required attributes, the IGW should also be responsible for [context propagation](https://opentelemetry.io/docs/concepts/context-propagation/) to preserve end-to-end trace continuity.

To prevent duplicate trace value / attributes, attributes that originate from IGW should be defined in IGW. We can use the OTel SDK environment variable to inject customized attributes if needed.

## Trace and Prompt/response events enablement ##

The OTel SDK initialization will watch for the environment variable `OTEL_EXPORTER_OTLP_ENDPOINT` as the signal of setting up the trace and event providers.

For Prompt/response events, IGW will watch for another environment variable `OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT` which becomes the [common env var](https://github.com/open-telemetry/opentelemetry-python-contrib/pull/2947) for content recording in OpenTelemetry.

### Other flags ###

#### Sampling ####

By default, the sampler `parentbased_traceidratio` will be selected. The IGW will read standard environment variables `OTEL_TRACES_SAMPLER_ARG` to set the sampling rate.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Manual instrumentation for traces and logs with OTLP #1520

Objective

Requirements

Definition

Out of scope

Request Flow

Detailed breakdown of span (with attributes) in EPP

Common attributes

Step 1: Ext_proc starts processing the request

Step 2: Director HandleRequest orchestrates the request lifecycle

[WIP] Step 3: Queueing/fairness layer

Step 4: Scheduling (plugin)

Step 5: Post response

Detailed breakdown of span for BBR (Body based routing)

Step 1: Ext_proc starts processing the request

Prompt/response logging

Connection with llm-d

Trace and Prompt/response events enablement

Other flags

Sampling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Manual instrumentation for traces and logs with OTLP #1520

Description

Objective

Requirements

Definition

Out of scope

Request Flow

Detailed breakdown of span (with attributes) in EPP

Common attributes

Step 1: Ext_proc starts processing the request

Step 2: Director HandleRequest orchestrates the request lifecycle

[WIP] Step 3: Queueing/fairness layer

Step 4: Scheduling (plugin)

Step 5: Post response

Detailed breakdown of span for BBR (Body based routing)

Step 1: Ext_proc starts processing the request

Prompt/response logging

Connection with llm-d

Trace and Prompt/response events enablement

Other flags

Sampling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions