-
Notifications
You must be signed in to change notification settings - Fork 79
Use server-timing for trace context response #560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
### traceresponse Header Field Values | ||
Metric name: `trace` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who hasn't attended the meetings (where I suppose this was discussed), this is a bit obscure. What does this name mean? Here, it's trace
, but should it always be that? What would be other good values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
server-timing is a set of metrics. This is one of them and we are trying to reserve the name trace
.
Alternative naming proposal is here: #561
|
||
This section describes the binding of the distributed trace context to the `traceresponse` HTTP header. | ||
This section describes the binding of the distributed trace context to a metric in the Server Timing HTTP header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest HTTP response header
(adding "response") to be super-clear.
|
||
#### child-id | ||
|
||
This is the ID of the operation of the callee (in some tracing systems, this is known as the `span-id`, where a `span` is the execution of a client request) and is used to uniquely identify an operation within a trace. It is represented as an 8-byte array, for example, `00f067aa0ba902b7`. All bytes as zero (`0000000000000000`) is considered an invalid value. | ||
This is the span ID of the server operation. It is represented as an 8-byte array, for example, `00f067aa0ba902b7`. An all-zero child ID (`0000000000000000`) is an invalid value. Tracing systems MUST ignore the trace context metric when the child id is invalid (for example, if it contains non-lowercase hex characters). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't normally like to bikeshed names too much, but to me this descriptive text suggests that the name of this field should be changed to span-id
. It would read much more cleanly and sidestep any questions of semantics ("child of.... what?")
1. An untrusted callee may be able to abuse a tracing system by setting these flags maliciously. | ||
2. A callee may have a bug which causes the tracing system to have a problem. | ||
3. Different load between calling and called services might force one or more participants to discard part or all of a trace. | ||
1. An untrusted server may be able to abuse a tracing system by setting these flags maliciously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these examples of thinking through security implications. I might add that simply exposing the trace id (and span id) itself to clients may present a risk in some scenarios. Do we need text explicitly stating that using the response header (server-timing: trace
) is optional for all participants and/or that compliant software SHOULD make emitting it configurable?
- If a component deferred or delayed the decision and only a subset of telemetry will be recorded, the `sampled` flag from the incoming `traceparent` header should be used if it is available. It should be set to `0` as the default option when the trace is initiated by this component. | ||
- If a component receives a `0` for the `sampled` flag on an incoming request, it may still decide to record a trace. In this case it SHOULD return a `sampled` flag `1` on the response so that the caller can update its sampling decision if required. | ||
- If the server deferred or delayed the decision and only a subset of telemetry will be recorded, the `sampled` flag from the incoming `traceparent` header should be used if it is available. It should be set to `0` as the default option when the trace is initiated by this server. | ||
- If the server receives a `0` for the `sampled` flag on an incoming request, it may still decide to record a trace. In this case it SHOULD return a `sampled` flag `1` on the response so that the client can update its sampling decision if required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this embodies the "I'm going to tell you, the client, how I actually traced this, insofar as I know it" spirit.
|
||
A participant that continues a trace started downstream — that is, if the participant uses the `trace-id` value from a `traceresponse` header it has received — MUST set the `random-trace-id` flag in its own `traceresponse` header to the same value that was found in the `traceresponse` header from which the `trace-id` was taken. | ||
A participant that continues a trace started downstream — that is, if the participant uses the `trace-id` value from a trace context server timing metric it has received — MUST set the `random-trace-id` flag in its own trace context server timing metric to the same value that was found in the trace context server timing metric from which the `trace-id` was taken. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a Client Interpretation
section on how participating clients may choose to "extend" the trace by producing data with the same trace-id
(as some browser instrumentation might do for page load), may choose to link the two traces in a system-appropriate way (e.g., as OTel span links or as simple Zipkin tags like linked.traceId
), may choose to ignore it, or may choose to log or otherwise respond when its intended trace propagation was not honored ("I sent you a traceparent
but your response shows you started a new trace")?
This is a first draft of the server-timing response header. It is not meant to be a final version, but a place to start discussion. It is mostly a direct translation with the following exceptions:
Version is optional
In the original version, all fields were required in order to simplify parsing. Now that we are a small component of a more complex header, that simplicity is lost, and the parsing simplification is no longer beneficial.
Flags are optional
The reasoning is the same as above. Additionally, some servers may not want to reveal information like the sampled flag to an untrusted client in order to avoid abuse.
Terminology Fixes
Changed ambiguous or nonstandard terms like "callee" and "caller" to established terms like "server" and "client."
Preview | Diff