Skip to content

Conversation

@iamemilio
Copy link
Contributor

@iamemilio iamemilio commented Nov 11, 2025

What does this PR do?

Fixes: #3806

  • Remove all custom telemetry core tooling
  • Remove telemetry that is captured by automatic instrumentation already
  • Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation
  • Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity

Test Plan

This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory.

Llama Stack Server Runner

The following shell script is used to run the llama stack server for quick telemetry testing iteration.

export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME="llama-stack-server"
export OTEL_SPAN_PROCESSOR="simple"
export OTEL_EXPORTER_OTLP_TIMEOUT=1
export OTEL_BSP_EXPORT_TIMEOUT=1000
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"

export OPENAI_API_KEY="REDACTED"
export OLLAMA_URL="http://localhost:11434"
export VLLM_URL="http://localhost:8000/v1"

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack run starter

Test Traffic Driver

This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger.

export OTEL_SERVICE_NAME="openai-client"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"

export GITHUB_TOKEN="REDACTED"

export MLFLOW_TRACKING_URI="http://127.0.0.1:5001"

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument python main.py
from openai import OpenAI
import os
import requests

def main():

    github_token = os.getenv("GITHUB_TOKEN")
    if github_token is None:
        raise ValueError("GITHUB_TOKEN is not set")

    client = OpenAI(
        api_key="fake",
        base_url="http://localhost:8321/v1/",
    )

    response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, how are you?"}]
    )
    print("Sync response: ", response.choices[0].message.content)

    streaming_response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, how are you?"}],
        stream=True,
        stream_options={"include_usage": True}
    )

    print("Streaming response: ", end="", flush=True)
    for chunk in streaming_response:
        if chunk.usage is not None:
            print("Usage: ", chunk.usage)
        if chunk.choices and chunk.choices[0].delta is not None:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()

    ollama_response = client.chat.completions.create(
        model="ollama/llama3.2:3b-instruct-fp16",
        messages=[{"role": "user", "content": "How are you doing today?"}]
    )
    print("Ollama response: ", ollama_response.choices[0].message.content)

    vllm_response = client.chat.completions.create(
        model="vllm/Qwen/Qwen3-0.6B",
        messages=[{"role": "user", "content": "How are you doing today?"}]
    )
    print("VLLM response: ", vllm_response.choices[0].message.content)

    responses_list_tools_response = client.responses.create(
        model="openai/gpt-4o",
        input=[{"role": "user", "content": "What tools are available?"}],
        tools=[
            {
                "type": "mcp",
                "server_label": "github",
                "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
                "authorization": github_token,
            }
        ],
    )
    print("Responses list tools response: ", responses_list_tools_response.output_text)

    responses_tool_call_response = client.responses.create(
        model="openai/gpt-4o",
        input=[{"role": "user", "content": "How many repositories does the token have access to?"}],
        tools=[
            {
                "type": "mcp",
                "server_label": "github",
                "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
                "authorization": github_token,
            }
        ],
    )
    print("Responses tool call response: ", responses_tool_call_response.output_text)

    # make shield call using http request until the client version error is resolved
    llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY")
    base_url = "http://localhost:8321/v1/"
    shield_id = "llama-guard-ollama"
    
    shields_url = f"{base_url}safety/run-shield"
    headers = {
        "Authorization": f"Bearer {llama_stack_api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "shield_id": shield_id,
        "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}],
        "params": {}
    }
    
    shields_response = requests.post(shields_url, json=payload, headers=headers)
    shields_response.raise_for_status()
    print("risk assessment response: ", shields_response.json())

if __name__ == "__main__":
    main()

Span Data

Inference

Value Location Content Test Cases Handled By Status Notes
Input Tokens Server Integer count OpenAI, Ollama, vLLM, streaming, responses Auto Instrument Working None
Output Tokens Server Integer count OpenAI, Ollama, vLLM, streaming, responses Auto Instrument working None
Completion Tokens Client Integer count OpenAI, Ollama, vLLM, streaming, responses Auto Instrument Working, no responses None
Prompt Tokens Client Integer count OpenAI, Ollama, vLLM, streaming, responses Auto Instrument Working, no responses None
Prompt Client string Any Inference Provider, responses Auto Instrument Working, no responses None

Safety

Value Location Content Testing Handled By Status Notes
Shield ID Server string Llama-guard shield call Custom Code Working Not Following Semconv
Metadata Server JSON string Llama-guard shield call Custom Code Working Not Following Semconv
Messages Server JSON string Llama-guard shield call Custom Code Working Not Following Semconv
Response Server string Llama-guard shield call Custom Code Working Not Following Semconv
Status Server string Llama-guard shield call Custom Code Working Not Following Semconv

Remote Tool Listing & Execution

Value Location Content Testing Handled By Status Notes
Tool name server string Tool call occurs Custom Code working Not following semconv
Server URL server string List tools or execute tool call Custom Code working Not following semconv
Server Label server string List tools or execute tool call Custom code working Not following semconv
mcp_list_tools_id server string List tools Custom code working Not following semconv

Metrics

  • Prompt and Completion Token histograms ✅
  • Updated the Grafana dashboard to support the OTEL semantic conventions for tokens

Observations

  • sqlite spans get orphaned from the completions endpoint
    • Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation.
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
  • Responses API instrumentation is missing in open telemetry for OpenAI clients, even with traceloop or openllmetry
    • Upstream issues in opentelemetry-pyton-contrib
  • Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior
  • MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue.

Updated Grafana Dashboard

Screenshot 2025-11-17 at 12 53 52 PM

Status

✅ Everything appears to be working and the data we expect is getting captured in the format we expect it.

Follow Ups

  1. Make tool calling spans follow semconv and capture more data
    1. Consider using existing tracing library
  2. Make shield spans follow semconv
  3. Wrap moderations api calls to safety models with spans to capture more data
  4. Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL
  5. This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution.
  6. Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an example of how bedrock handles it.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025
@mergify
Copy link

mergify bot commented Nov 11, 2025

This pull request has merge conflicts that must be resolved before it can be merged. @iamemilio please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Nov 13, 2025

This pull request has merge conflicts that must be resolved before it can be merged. @iamemilio please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

This change created a standardized way to handle telemetry internally.
All custom names that are not a semantic convention are maintained in
constants.py. Helper functions to capture custom telemetry data not
captured by automatic instrumentation are in helpers.py.
Calls to custom span capture tooling is replaced with calls to the open
telemetry library 1:1. No additional modifications were made, and
formatting changes can be addressed in follow up PRs.
@github-actions
Copy link
Contributor

github-actions bot commented Nov 17, 2025

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation

Edit this comment to update it. It will appear in the SDK's changelogs.

llama-stack-client-node studio · code · diff

Your SDK built successfully.
generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/llama-stack-client-node/0f36d625d87c8798ab9b748f8b6a6d97806b001b/dist.tar.gz
llama-stack-client-kotlin studio · code · diff

Your SDK built successfully.
generate ⚠️lint ✅test ❗

llama-stack-client-python studio · code · diff

generate ⚠️build ⏳lint ⏳test ⏳

llama-stack-client-go studio · code · diff

Your SDK built successfully.
generate ⚠️lint ❗test ❗

go get github.com/stainless-sdks/llama-stack-client-go@1ad96fea88be605434a58cfceecf1c917d8ee78c

⏳ These are partial results; builds are still running.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-11-18 23:00:30 UTC

@iamemilio iamemilio changed the title feat(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation feat!(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation Nov 17, 2025
@iamemilio iamemilio changed the title feat!(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation Nov 17, 2025
@grs
Copy link
Contributor

grs commented Nov 18, 2025

Looks good to me.

@iamemilio
Copy link
Contributor Author

I am noticing that the responses test suite fails often on this PR, and I can't tell if its related to the changes I made or not. I tried not to change the logical outcome of any of the code modified, but I would appreciate if someone more knowledgable about the async logic could take a look and help me on this one. The root cause is a bit lost on me, and the AI's are clueless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-Architect Llama Stack Telemetry to use OTEL Automatic Instrumentation

2 participants