Skip to content

Proposal: Manual instrumentation for traces and logs with OTLP #1520

@JeffLuoo

Description

@JeffLuoo

Objective

To provide detailed and insightful trace & event in the IGW that is aligned with the llm-d distributed tracing proposal.

Requirements

  • Aligns with llm-d distributed tracing proposal: proposal.
  • The trace & event solution should follow OSS standard, which is the Semantic conventions for GenAI spans: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/.
  • The trace span should cover critical steps in the request handling logic in the IGW, with each critical step in an individual span and useful information in attributes. Not every internal function calls and parameter will be covered.
  • Vendor Agnostic.
  • The request tracing and prompt/response logging is opt-in.

Definition

The trace is going to track the end-to-end lifecycle of a request in IGW, the definition of end-to-end is:

From

“Inference Gateway external processor starts processing the request”

to

“Inference Gateway external processor returns the request to users”.

Out of scope

  • Multi-modal support is not in the scope, can be a follow-up of the project's support for Multi-modal server.

Request Flow

Reference https://gateway-api-inference-extension.sigs.k8s.io/ for the request flow.

Image

This proposal covers step 2, 3, and 4.

Detailed breakdown of span (with attributes) in EPP

The OpenTelemetry community recently merged a change that updated the span and event attributes for GenA. The proposal below will follow the latest semantic convention. The order of steps is the parent-child hierarchy of spans.

Common attributes

  • service.name: gateway-api-inference-extension.
  • service.version: Release version of the Inference Gateway.
  • gen_ai.request.model: Model name in the request.
  • gateway.inferenobjective: The associated inferenceObjective to the request, a default one will be set if no one is found. The default infObjective will have the name default_inferenceObjective.
  • gateway.inferencepool: The name of inferencepool received the forwarded traffic.
  • gateway.streaming: Boolean value that indicates the streaming request.
  • gateway.model.rewrite: Value of x-gateway-model-name-rewrite in request header.

Step 1: Ext_proc starts processing the request

func (s *StreamingServer) Process(srv extProcPb.ExternalProcessor_ProcessServer) error

This will be the root span of the request handling process, as the Process function is the entry point of all requests in EPP.

Span Name: gateway.ext_proc.epp.request

Attributes:

  • gen_ai.usage.input_tokens: The number of tokens used in the GenAI input (prompt).
  • gen_ai.usage.output_tokens: The number of tokens used in the GenAI response (completion).

Step 2: Director HandleRequest orchestrates the request lifecycle

This process:

  1. Parses request details.
  2. Calls admitRequest for admission control.
  3. Calls Scheduler.Schedule if request is approved.
  4. Calls prepareRequest to populate RequestContext with result and call PreRequest plugins.

Span Name: gateway.request_orchestration

Attributes:

  • target_model: The resolved target model name.
  • request_criticality: The criticality of the request.

[WIP] Step 3: Queueing/fairness layer

Under development, more details will be added.

Span Name: gateway.queueing_fairness

Attributes: TBD

Step 4: Scheduling (plugin)

This span provides detailed insights into the Scheduling Subsystem of EPP.

Span Name: gateway.scheduling

Attributes:

  • gateway.scheduling.pod: The Inference Server pod under inferencepool that is scheduled to serve the request.
  • gateway.scheduling.profile: The selected profile to schedule the request.
  • server.address: GenAI server address of the selected pod.

An example of plugins configmap:

apiVersion: v1
data:
  default-plugins.yaml: |
    apiVersion: inference.networking.x-k8s.io/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: queue-scorer
    - type: kv-cache-utilization-scorer
    - type: prefix-cache-scorer
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: queue-scorer
      - pluginRef: kv-cache-utilization-scorer
      - pluginRef: prefix-cache-scorer
kind: ConfigMap

Scheduling subsystem is a system that its architecture allows for a pluggable scheduling algorithm. EPP should pass the context with tracing metadata to the child process. Scheduler select Scheduler Profiles iteratively, each scheduler plugin execution should be treated as a separate child span and connected under the same parent span of gateway.ext_proc.scheduling. A plugin can be either Filter, Score, or Pick.

I propose a naming convention here for span name of plugins:

scheduling_plugin_<plugin name>

Step 5: Post response

Post response happens after receiving response from model servers. (Code link). It also contributes to the end-to-end latency of the request because it’s invoked under ext_proc HandleResponseXXX functions. Similar to Scheduling plugins, EPP is also going to iterate over all registered plugins so each plugin’s execution will be in a separate span. If the model server trace is enabled, this span will be appended after the model server’s spans.

Span Name: gateway.post_response

Child Span Name: post_response_plugin_<plugin name>

Detailed breakdown of span for BBR (Body based routing)

Common attributes

  • service.name: gateway-api-inference-extension.
  • service.version: Release version of the Inference Gateway.
  • gen_ai.request.model: Model name in the request.
  • gateway.inferencepool: The name of inferencepool received the forwarded traffic.
  • gateway.streaming: Boolean value that indicates the streaming request.

Step 1: Ext_proc starts processing the request

Similar to EPP, this is the entrypoint of the request and the root span of the trace.

Span Name: gateway.ext_proc.bbr.request

Attributes:

  • gen_ai.usage.input_tokens: The number of tokens used in the GenAI input (prompt).
  • gen_ai.usage.output_tokens: The number of tokens used in the GenAI response (completion).

BBR has a thinner layer compared to the EPP, a single root span with all required attributes and status should be sufficient to start with.

Prompt/response logging

This function is going to provide correlated events (logs) with the spans collected above. It follows GenAI Events Semantic Convention: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/.

When we are using the OTel SDK to collect the trace span, we also capture the input and output from IGW that can be used by correlation in many backends. For example:
https://grafana.com/docs/grafana/latest/datasources/tempo/traces-in-grafana/trace-correlations/

In addition to signal correlation, the Prompt/response events can be used for auditing, evaluation, and compliance purposes.

Required Attributes:

  • gen_ai.input.messages: GenAI input (prompt).
  • gen_ai.output.messages: GenAI response (completion).

Common attributes like service name and server endpoint can also be added to the event if needed.

Connection with llm-d

Collaborate with the llm-d community to see how to populate such attributes. In addition to populating required attributes, the IGW should also be responsible for context propagation to preserve end-to-end trace continuity.

To prevent duplicate trace value / attributes, attributes that originate from IGW should be defined in IGW. We can use the OTel SDK environment variable to inject customized attributes if needed.

Trace and Prompt/response events enablement

The OTel SDK initialization will watch for the environment variable OTEL_EXPORTER_OTLP_ENDPOINT as the signal of setting up the trace and event providers.

For Prompt/response events, IGW will watch for another environment variable OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT which becomes the common env var for content recording in OpenTelemetry.

Other flags

Sampling

By default, the sampler parentbased_traceidratio will be selected. The IGW will read standard environment variables OTEL_TRACES_SAMPLER_ARG to set the sampling rate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions