feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation #4127

iamemilio · 2025-11-11T20:59:43Z

What does this PR do?

Remove all custom telemetry core tooling
Remove telemetry that is captured by automatic instrumentation already
Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation
Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity

Test Plan

This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory.

Llama Stack Server Runner

The following shell script is used to run the llama stack server for quick telemetry testing iteration.

export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME="llama-stack-server"
export OTEL_SPAN_PROCESSOR="simple"
export OTEL_EXPORTER_OTLP_TIMEOUT=1
export OTEL_BSP_EXPORT_TIMEOUT=1000
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"

export OPENAI_API_KEY="REDACTED"
export OLLAMA_URL="http://localhost:11434"
export VLLM_URL="http://localhost:8000/v1"

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack run starter

Test Traffic Driver

This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger.

export OTEL_SERVICE_NAME="openai-client"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"

export GITHUB_TOKEN="REDACTED"

export MLFLOW_TRACKING_URI="http://127.0.0.1:5001"

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument python main.py

from openai import OpenAI
import os
import requests

def main():

    github_token = os.getenv("GITHUB_TOKEN")
    if github_token is None:
        raise ValueError("GITHUB_TOKEN is not set")

    client = OpenAI(
        api_key="fake",
        base_url="http://localhost:8321/v1/",
    )

    response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, how are you?"}]
    )
    print("Sync response: ", response.choices[0].message.content)

    streaming_response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, how are you?"}],
        stream=True,
        stream_options={"include_usage": True}
    )

    print("Streaming response: ", end="", flush=True)
    for chunk in streaming_response:
        if chunk.usage is not None:
            print("Usage: ", chunk.usage)
        if chunk.choices and chunk.choices[0].delta is not None:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()

    ollama_response = client.chat.completions.create(
        model="ollama/llama3.2:3b-instruct-fp16",
        messages=[{"role": "user", "content": "How are you doing today?"}]
    )
    print("Ollama response: ", ollama_response.choices[0].message.content)

    vllm_response = client.chat.completions.create(
        model="vllm/Qwen/Qwen3-0.6B",
        messages=[{"role": "user", "content": "How are you doing today?"}]
    )
    print("VLLM response: ", vllm_response.choices[0].message.content)

    responses_list_tools_response = client.responses.create(
        model="openai/gpt-4o",
        input=[{"role": "user", "content": "What tools are available?"}],
        tools=[
            {
                "type": "mcp",
                "server_label": "github",
                "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
                "authorization": github_token,
            }
        ],
    )
    print("Responses list tools response: ", responses_list_tools_response.output_text)

    responses_tool_call_response = client.responses.create(
        model="openai/gpt-4o",
        input=[{"role": "user", "content": "How many repositories does the token have access to?"}],
        tools=[
            {
                "type": "mcp",
                "server_label": "github",
                "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
                "authorization": github_token,
            }
        ],
    )
    print("Responses tool call response: ", responses_tool_call_response.output_text)

    # make shield call using http request until the client version error is resolved
    llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY")
    base_url = "http://localhost:8321/v1/"
    shield_id = "llama-guard-ollama"
    
    shields_url = f"{base_url}safety/run-shield"
    headers = {
        "Authorization": f"Bearer {llama_stack_api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "shield_id": shield_id,
        "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}],
        "params": {}
    }
    
    shields_response = requests.post(shields_url, json=payload, headers=headers)
    shields_response.raise_for_status()
    print("risk assessment response: ", shields_response.json())

if __name__ == "__main__":
    main()

Span Data

Inference

Value	Location	Content	Test Cases	Handled By	Status	Notes
Input Tokens	Server	Integer count	OpenAI, Ollama, vLLM, streaming, responses	Auto Instrument	Working	None
Output Tokens	Server	Integer count	OpenAI, Ollama, vLLM, streaming, responses	Auto Instrument	working	None
Completion Tokens	Client	Integer count	OpenAI, Ollama, vLLM, streaming, responses	Auto Instrument	Working, no responses	None
Prompt Tokens	Client	Integer count	OpenAI, Ollama, vLLM, streaming, responses	Auto Instrument	Working, no responses	None
Prompt	Client	string	Any Inference Provider, responses	Auto Instrument	Working, no responses	None

Safety

Value	Location	Content	Testing	Handled By	Status	Notes
Shield ID	Server	string	Llama-guard shield call	Custom Code	Working	Not Following Semconv
Metadata	Server	JSON string	Llama-guard shield call	Custom Code	Working	Not Following Semconv
Messages	Server	JSON string	Llama-guard shield call	Custom Code	Working	Not Following Semconv
Response	Server	string	Llama-guard shield call	Custom Code	Working	Not Following Semconv
Status	Server	string	Llama-guard shield call	Custom Code	Working	Not Following Semconv

Remote Tool Listing & Execution

Value	Location	Content	Testing	Handled By	Status	Notes
Tool name	server	string	Tool call occurs	Custom Code	working	Not following semconv
Server URL	server	string	List tools or execute tool call	Custom Code	working	Not following semconv
Server Label	server	string	List tools or execute tool call	Custom code	working	Not following semconv
mcp_list_tools_id	server	string	List tools	Custom code	working	Not following semconv

Metrics

Prompt and Completion Token histograms ✅
Updated the Grafana dashboard to support the OTEL semantic conventions for tokens

Observations

sqlite spans get orphaned from the completions endpoint
- Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation.

export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"

Responses API instrumentation is missing in open telemetry for OpenAI clients, even with traceloop or openllmetry
- Upstream issues in opentelemetry-pyton-contrib
Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior
MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue.

Updated Grafana Dashboard

Status

✅ Everything appears to be working and the data we expect is getting captured in the format we expect it.

Follow Ups

Make tool calling spans follow semconv and capture more data
1. Consider using existing tracing library
Make shield spans follow semconv
Wrap moderations api calls to safety models with spans to capture more data
Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL
This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution.
Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an example of how bedrock handles it.

mergify · 2025-11-11T21:00:25Z

This pull request has merge conflicts that must be resolved before it can be merged. @iamemilio please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-11-13T22:28:39Z

This pull request has merge conflicts that must be resolved before it can be merged. @iamemilio please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…trace protocol

This change created a standardized way to handle telemetry internally. All custom names that are not a semantic convention are maintained in constants.py. Helper functions to capture custom telemetry data not captured by automatic instrumentation are in helpers.py.

Calls to custom span capture tooling is replaced with calls to the open telemetry library 1:1. No additional modifications were made, and formatting changes can be addressed in follow up PRs.

This change removes all the hand written telemetry machinery that has been replaced in prior changes with open telemetry library calls.

github-actions · 2025-11-17T17:41:15Z

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation

Edit this comment to update it. It will appear in the SDK's changelogs.

✅ llama-stack-client-node studio · code · diff

Your SDK built successfully.
generate ⚠️ → build ✅ → lint ✅ → test ✅
npm install https://pkg.stainless.com/s/llama-stack-client-node/0f36d625d87c8798ab9b748f8b6a6d97806b001b/dist.tar.gz

✅ llama-stack-client-kotlin studio · code · diff

Your SDK built successfully.
generate ⚠️ → lint ✅ → test ❗

⏳ llama-stack-client-python studio · code · diff

generate ⚠️ → build ⏳ → lint ⏳ → test ⏳

✅ llama-stack-client-go studio · code · diff

Your SDK built successfully.
generate ⚠️ → lint ❗ → test ❗
go get github.com/stainless-sdks/llama-stack-client-go@1ad96fea88be605434a58cfceecf1c917d8ee78c

⏳ These are partial results; builds are still running.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-11-18 23:00:30 UTC

grs · 2025-11-18T16:46:35Z

Looks good to me.

This reverts commit 914bf84.

iamemilio · 2025-11-19T00:35:12Z

I am noticing that the responses test suite fails often on this PR, and I can't tell if its related to the changes I made or not. I tried not to change the logical outcome of any of the code modified, but I would appreciate if someone more knowledgable about the async logic could take a look and help me on this one. The root cause is a bit lost on me, and the AI's are clueless.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

iamemilio force-pushed the auto_instrument_1 branch from e2aabf1 to 990e5ed Compare November 13, 2025 21:21

iamemilio marked this pull request as ready for review November 13, 2025 21:22

iamemilio requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners November 13, 2025 21:22

mergify bot removed the needs-rebase label Nov 13, 2025

iamemilio force-pushed the auto_instrument_1 branch from 6591240 to 7bf0e84 Compare November 13, 2025 21:35

mergify bot added the needs-rebase label Nov 13, 2025

iamemilio force-pushed the auto_instrument_1 branch from 7bf0e84 to ad0eef7 Compare November 17, 2025 17:24

iamemilio added 8 commits November 17, 2025 12:24

fix(metrics): capture token metrics using auto instrumentation

153e21b

fix(telemetry): attribute __location__ provides no value. Removing it.

4b68e08

fix(telemetry): rely on opentelemtry to propogate trace context, not …

5ac9109

…trace protocol

fix(telemetry): allow open telemetry to inject trace headers

a3c2824

fix(breaking): remove telemtry config

89e27c3

fix(telemetry): chat completions with metrics was dead code

f0646ab

fix(telemetry): MCP tools capture spans using Open Telemetry

b049b29

Calls to custom span capture tooling is replaced with calls to the open telemetry library 1:1. No additional modifications were made, and formatting changes can be addressed in follow up PRs.

iamemilio added 9 commits November 17, 2025 12:28

fix(telemetry): remove legacy telemetry tools

db27ad5

This change removes all the hand written telemetry machinery that has been replaced in prior changes with open telemetry library calls.

fix(telemetry): remove unnessary calls to legacy tracing middleware

8b46d59

fix(telemetry): fixes based on testing

ce92a44

fix(telemetry): remove telemetry tests :(

350650d

docs(telemetry): update docs to reflect the telemetry re-architecture

4ef8982

fix(telemetry): move out of core + fix name for safety

26c4633

fix(rebase): resolve errors from rebase

99f9a9a

fix(telemetry): pre-commit fixes

fda592d

fix(rebase): clean up errors after rebasing and patch the dashboard

0c442cd

iamemilio force-pushed the auto_instrument_1 branch from ad0eef7 to 0c442cd Compare November 17, 2025 17:36

mergify bot removed the needs-rebase label Nov 17, 2025

fix(rebase): pre-commit formatting fixes caused by rebase

f0324d4

iamemilio changed the title ~~feat(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation~~ feat!(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation Nov 17, 2025

iamemilio changed the title ~~feat!(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation~~ feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation Nov 17, 2025

iamemilio and others added 3 commits November 17, 2025 14:37

Merge branch 'main' into auto_instrument_1

8eb0f78

Merge branch 'main' into auto_instrument_1

7d8cef6

fix(docs): clean up branding and wording in telemetry docs

cb357dd

iamemilio force-pushed the auto_instrument_1 branch from c6fa7da to cb357dd Compare November 18, 2025 15:58

This was referenced Nov 18, 2025

fix: remove custom tracing middleware #3723

Closed

Re-Architect Llama Stack Telemetry to use OTEL Automatic Instrumentation #3806

Open

iamemilio and others added 2 commits November 18, 2025 15:06

Merge branch 'main' into auto_instrument_1

5984ae6

fix: context token explosion bug fixed

914bf84

iamemilio force-pushed the auto_instrument_1 branch from 13a2960 to 914bf84 Compare November 18, 2025 23:04

Revert "fix: context token explosion bug fixed"

8f20833

This reverts commit 914bf84.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation #4127

feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation #4127

iamemilio commented Nov 11, 2025 •

edited

Loading

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

mergify bot commented Nov 13, 2025

Uh oh!

github-actions bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

grs commented Nov 18, 2025

Uh oh!

iamemilio commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation #4127

Are you sure you want to change the base?

feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation #4127

Conversation

iamemilio commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Llama Stack Server Runner

Test Traffic Driver

Span Data

Inference

Safety

Remote Tool Listing & Execution

Metrics

Observations

Updated Grafana Dashboard

Status

Follow Ups

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

mergify bot commented Nov 13, 2025

Uh oh!

github-actions bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

grs commented Nov 18, 2025

Uh oh!

iamemilio commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iamemilio commented Nov 11, 2025 •

edited

Loading

github-actions bot commented Nov 17, 2025 •

edited

Loading