-
Notifications
You must be signed in to change notification settings - Fork 6.7k
feat(telemetry): add OpenTelemetry instrumentation with Aspire Dashboard support #6629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…andard attribute names
…r gRPC trace export
Change experimental.openTelemetry config from boolean to union type supporting both boolean and object with enabled/endpoint fields. This allows users to configure custom OTLP endpoints for Aspire Dashboard integration while maintaining backward compatibility with boolean config.
…tion Add telemetry module with: - Config interface and resolveConfig() for endpoint resolution - init() function with NodeSDK, LoggerProvider, trace/log exporters - shutdown() for graceful cleanup - withSpan() helper for span creation with error handling - isEnabled(), getTracer(), getLogger() utility functions - SeverityMap for log level mapping
Integrate OpenTelemetry log emission into the Log module. When telemetry is enabled, all log messages (debug/info/warn/error) are emitted to the OTLP endpoint alongside file-based logging. - Lazy-load telemetry module to avoid circular dependency - Guard against recursive calls during module initialization - Emit logs with proper severity levels using Telemetry.SeverityMap
- Initialize telemetry in yargs middleware after Log.init() - Check OTEL_EXPORTER_OTLP_ENDPOINT env var or config.experimental.openTelemetry - Register SIGTERM and SIGINT handlers for graceful shutdown - Call Telemetry.shutdown() in finally block before process.exit()
# Conflicts: # bun.lock # packages/opencode/package.json
Add the standard OpenTelemetry endpoint environment variable to the Flag namespace for use in config loading to consolidate telemetry enablement checks.
… var checks Since OTEL_EXPORTER_OTLP_ENDPOINT is now applied at config load time (Phase 2), the CLI entry points no longer need to check the env var directly. This removes the conditional that skipped config loading when the env var was set.
…isEnabled() - Replace inline env var and config check with Telemetry.isEnabled() helper - Remove unused Config import since telemetry config is now consolidated - This ensures consistent telemetry enablement logic via single source of truth
The OTEL_EXPORTER_OTLP_ENDPOINT env var is now applied to config at load time (in config/config.ts), so resolveConfig no longer needs to check it directly. This simplifies the function to only handle the config object.
Add unit tests for Telemetry.resolveConfig and config loading behavior: - Test resolveConfig handles boolean/object/undefined inputs correctly - Test config loading from file with boolean and object openTelemetry config - Test openTelemetry defaults to undefined when not configured - Test OTEL_EXPORTER_OTLP_ENDPOINT env var override behavior Update plan.md to mark testing task as completed.
| }) | ||
| logs.setGlobalLoggerProvider(loggerProvider) | ||
|
|
||
| sdk = new NodeSDK({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation (and what's somewhat conventional) is the app implementation is responsible for setting up this exporter machinery and then if the app is using a library that has an existing otel instrumentation, you enable that. For example ai-sdk provides otel instrumentation. If you use the openai sdk or claude sdk directly, you'd leverage that instrumentation.
The main issue I could imaging (assuming you don't want to be churning on setting all the attributes to work well across vendors) is that the attribute naming for llm related spans is still a bit of a mess as everyone is trying to figure out how to consistently name all these attributes.
If you're just collecting traces for performance sake and don't care about llm/eval then all these traces will show up just fine with the span operation.names you've defined in any trace viewer. The use case I mostly care about is shipping the signals to a tool like langfuse. Those tools expect specific names to show things like sessionID, llm generation, tool call etc.
The different vendors a re working on making the core attributes more uniform but it's not there yet, so personally, I'd try and punt most of that churn on to something like the ai sdk.
Here's a quick snapshot of the landscape of attribute definitions
- Openllmetry has a reasonable collection here https://github.com/traceloop/openllmetry/blob/main/packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py
- Here's what ai-sdk collects https://ai-sdk.dev/docs/ai-sdk-core/telemetry#collected-data
- Langfuse span operation types https://langfuse.com/docs/observability/features/observation-types and trace attributes https://langfuse.com/integrations/native/opentelemetry#trace-level-attributes
- The emerging OTEL semantic conventions (still very incomplete)
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ e.g. operation types still don't have a standard naming here yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed on the running application chooses the exporter.
ill check more but running opencode cli should configure the exporter.
also tbh this otel work is just to support development/profiling/debugging for now.
i'll check those emerging standards for attributes to see if I can consolidate.
for now any span/attribute is good and we can easily rename later.
I'll check your other comments later but I'm sure you saw I pushed a big refactor to clean up the implementation to be more like a decorator/using pattern to remove heaps of noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, saw the cleanup / refactor. This comment still applies and summarizes whatever still applies after your refactor.
agreed on the running application chooses the exporter.
ill check more but running opencode cli should configure the exporter.
Yep, I think we're saying the same thing here. the current experimental_telemetry just enables the ai sdk instrumentation but doesn't start an exporter. So mainly calling out that the biggest missing piece is something needs to start the exporter.
also tbh this otel work is just to support development/profiling/debugging for now.
That makes sense. Mostly calling out that if you retain the ai-sdk enabling, you don't need to re-instrument llm calls, tool calls etc since those are already done for you and will the most up to date evolving attributes so the same traces become useful for building agentic engineering evals or workflow review (the part I'm actually more excited about).
That obviously doesn't prevent wrapping those ai SDK calls in you own spans to get even finer-grained instrumentation.
That said, if you're mostly interested in instrumentation for performance profiling. I'd definitely consider setting up the node.js otel auto-instrumentation and metrics. Eg you'll probably find at least having metrics around GC stats like runs and pauses useful.
crude example:
const sdk = new NodeSDK({
traceExporter: new ConsoleSpanExporter(),
metricReader: new PeriodicExportingMetricReader({
exporter: new ConsoleMetricExporter(),
}),
instrumentations: [
getNodeAutoInstrumentations()
],
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got some more work to make it perfect - but agree on your points.
i'll double check the ai sdk vs my custom spans.
the aspire dashboard looked pretty nice but i'll check if there's exact double ups
yup I'll add the typical instrumentation for node, even seeing if our underlying opentui/zig stuff can have instrumentation.
I'll see what is first cut/vs add to the PR. the team will check this out over the next few weeks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, if you wanted to spot check stuff against another otel collector and trace view, you can run a langfuse stack fully locally
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up
everything (web app and otel collector endpoint is available on localhost:3000).
OpenTelemetry support for OpenCode is pending upstream approval. Link to tracking PR: anomalyco/opencode#6629 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds experimental OpenTelemetry support for debugging and observability.
What
bun run dev:otelopencode-clivsopencode-serverkey=valuecontext + exception stack tracesEnabling OpenTelemetry
~/.config/opencode/opencode.json:{ "experimental": { "openTelemetry": true } }cd packages/opencode bun run dev:otelThe
OTEL_EXPORTER_OTLP_ENDPOINTenv var controls the endpoint (defaults tohttp://localhost:4317).Images