Skip to content

Use server-timing for trace context response #560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dyladan
Copy link
Member

@dyladan dyladan commented Feb 27, 2024

This is a first draft of the server-timing response header. It is not meant to be a final version, but a place to start discussion. It is mostly a direct translation with the following exceptions:

Version is optional

In the original version, all fields were required in order to simplify parsing. Now that we are a small component of a more complex header, that simplicity is lost, and the parsing simplification is no longer beneficial.

Flags are optional

The reasoning is the same as above. Additionally, some servers may not want to reveal information like the sampled flag to an untrusted client in order to avoid abuse.

Terminology Fixes

Changed ambiguous or nonstandard terms like "callee" and "caller" to established terms like "server" and "client."


Preview | Diff


### traceresponse Header Field Values
Metric name: `trace`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who hasn't attended the meetings (where I suppose this was discussed), this is a bit obscure. What does this name mean? Here, it's trace, but should it always be that? What would be other good values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server-timing is a set of metrics. This is one of them and we are trying to reserve the name trace.

Alternative naming proposal is here: #561


This section describes the binding of the distributed trace context to the `traceresponse` HTTP header.
This section describes the binding of the distributed trace context to a metric in the Server Timing HTTP header.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest HTTP response header (adding "response") to be super-clear.


#### child-id

This is the ID of the operation of the callee (in some tracing systems, this is known as the `span-id`, where a `span` is the execution of a client request) and is used to uniquely identify an operation within a trace. It is represented as an 8-byte array, for example, `00f067aa0ba902b7`. All bytes as zero (`0000000000000000`) is considered an invalid value.
This is the span ID of the server operation. It is represented as an 8-byte array, for example, `00f067aa0ba902b7`. An all-zero child ID (`0000000000000000`) is an invalid value. Tracing systems MUST ignore the trace context metric when the child id is invalid (for example, if it contains non-lowercase hex characters).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't normally like to bikeshed names too much, but to me this descriptive text suggests that the name of this field should be changed to span-id. It would read much more cleanly and sidestep any questions of semantics ("child of.... what?")

1. An untrusted callee may be able to abuse a tracing system by setting these flags maliciously.
2. A callee may have a bug which causes the tracing system to have a problem.
3. Different load between calling and called services might force one or more participants to discard part or all of a trace.
1. An untrusted server may be able to abuse a tracing system by setting these flags maliciously.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these examples of thinking through security implications. I might add that simply exposing the trace id (and span id) itself to clients may present a risk in some scenarios. Do we need text explicitly stating that using the response header (server-timing: trace) is optional for all participants and/or that compliant software SHOULD make emitting it configurable?

- If a component deferred or delayed the decision and only a subset of telemetry will be recorded, the `sampled` flag from the incoming `traceparent` header should be used if it is available. It should be set to `0` as the default option when the trace is initiated by this component.
- If a component receives a `0` for the `sampled` flag on an incoming request, it may still decide to record a trace. In this case it SHOULD return a `sampled` flag `1` on the response so that the caller can update its sampling decision if required.
- If the server deferred or delayed the decision and only a subset of telemetry will be recorded, the `sampled` flag from the incoming `traceparent` header should be used if it is available. It should be set to `0` as the default option when the trace is initiated by this server.
- If the server receives a `0` for the `sampled` flag on an incoming request, it may still decide to record a trace. In this case it SHOULD return a `sampled` flag `1` on the response so that the client can update its sampling decision if required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this embodies the "I'm going to tell you, the client, how I actually traced this, insofar as I know it" spirit.


A participant that continues a trace started downstream — that is, if the participant uses the `trace-id` value from a `traceresponse` header it has received — MUST set the `random-trace-id` flag in its own `traceresponse` header to the same value that was found in the `traceresponse` header from which the `trace-id` was taken.
A participant that continues a trace started downstream — that is, if the participant uses the `trace-id` value from a trace context server timing metric it has received — MUST set the `random-trace-id` flag in its own trace context server timing metric to the same value that was found in the trace context server timing metric from which the `trace-id` was taken.

Copy link

@johnbley johnbley Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a Client Interpretation section on how participating clients may choose to "extend" the trace by producing data with the same trace-id (as some browser instrumentation might do for page load), may choose to link the two traces in a system-appropriate way (e.g., as OTel span links or as simple Zipkin tags like linked.traceId), may choose to ignore it, or may choose to log or otherwise respond when its intended trace propagation was not honored ("I sent you a traceparent but your response shows you started a new trace")?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants