Request log tracing #206

j-baker · 2019-07-30T10:25:48Z

At the moment, we send trace ids and span ids with our requests. This is
excellent, and provides some coarse grained information. That said, it's
very limited - you can merely know that some number of requests were
triggered by the same underlying operation.

Here, when a trace is originated with a parent span, we keep track of
that parent (called an 'originating' span id) and also send it with
requests. This means that we can use request logs to piece together
accurate service level tracing as well as trace logs.

In other words, if I have a call graph which looks like:

service a/foo:
   check auth:
     call auth service:
       auth service/check auth:
   do work:
     call work service:
       work service/do operation:
         call auth service:
           auth service/check auth:

then I will see that in the trace logs (provided the request is sampled) but I can derive

service a/foo:
   auth service/check auth:
   work service/do operation:
     auth service/check auth:

from the request logs on all requests. Currently this is possible
when requests are single-threaded, but not when people call services in
parallel.

The possible downsides are extra network overhead by communicating a new header over the wire. However, this is likely to be very small (28 bytes per request), and if there is a problem with this we can always stop communicating the parent span id, which is almost surely useless as a request header.

changelog-app · 2019-07-30T10:25:52Z

Generate changelog in `changelog/@unreleased`

Type

Description

Add originating span id to requests in order to enable request log tracing

Check the box to generate changelog(s)

Generate changelog entry

At the moment, we send trace ids and span ids with our requests. This is excellent, and provides some coarse grained information. That said, it's very limited - you can merely know that some number of requests were triggered by the same underlying operation. Here, when a trace is originated with a parent span, we keep track of that parent (called an 'originating' span id) and also send it with requests. This means that we can use request logs to piece together accurate service level tracing as well as trace logs. In other words, if I have a call graph which looks like: ```yml service a/foo: check auth: call auth service: auth service/check auth: do work: call work service: work service/do operation: call auth service: auth service/check auth: ``` then I will see that in the trace logs (sampled) but I can derive ```yml service a/foo: auth service/check auth: work service/do operation: auth service/check auth: ``` from the request logs on all requests. Currently this is possible when requests are single-threaded, but not when people call services in parallel.

iamdanfox · 2019-07-30T11:42:30Z

tracing-api/src/main/java/com/palantir/tracing/api/TraceHttpHeaders.java

+     * information to reconstruct a request-level trace. For service-internal tracing, the typical trace logs
+     * (with sampling) are still required.
+     */
+    String ORIGINATING_SPAN_ID = "X-OrigSpanId";


Could we also make sure this header name is mentioned in the README so it's vaguely googleable? This is an extra piece of information that we've made up that's not in ZipKin, so would be good to make sure it's super obvious!

Capturing the naming discussion - "incoming span id" doesn't quite make sense for the first time this gets emitted as a header, because the first value sent was just made up inside the service.

Does this need to resemble a header? If we have the ability to push non-span parameters through all of the glue, it'd be nice if it were instead something like _rootSpan or _origSpanId to match with the other "special" parameters that we have.

it is actually sent over the wire as a header, if that makes a difference?

Ah gotcha - I was thinking this was just tracked separately on each server.

tracing-api/src/main/java/com/palantir/tracing/api/OpenSpan.java

iamdanfox · 2019-07-30T11:46:18Z

@sfackler for SA just in case you wanted to do something similar in https://github.com/palantir/rust-zipkin?

tracing-api/src/main/java/com/palantir/tracing/api/TraceHttpHeaders.java

sfackler · 2019-07-30T15:49:00Z

Nice! I'd been thinking about doing something like this as well.

tracing-api/src/main/java/com/palantir/tracing/api/TraceHttpHeaders.java

tracing-okhttp3/src/main/java/com/palantir/tracing/okhttp3/OkhttpTraceInterceptor.java

ellisjoe · 2019-07-30T17:46:42Z

No real opposition here, but have we also considered just logging all trace logs that are associated with outbound or incoming requests? That would be O(request logs) that we'd end up logging and ingesting, but would allow us to only rely on the trace log format for tracing analysis.

ellisjoe · 2019-07-30T17:48:02Z

I guess you still need to plumb this through internally, but think you can get away without pushing it across the wire in that case.

j-baker · 2019-07-31T08:03:05Z

Considered that, but:

We'd then need to augment the trace log format to allow specification of this, redefine intermediate storage formats and the indexing systems in order to provide access.
We'd need to augment the systems receiving the trace logs to provide separate indexing and search capabilities.
By doing it this way, the information can be used by default in internal data pipelines, whereas additional investment would be required if this is in the trace log files.

ellisjoe · 2019-07-31T16:54:05Z

I think 1 and 2 are handled just by saying the parentSpan of any outbound request becomes the span generated when a request is initiated, in the same way you're doing the parenting in this PR. You're just squashing the span stack and reparenting as you do so.

j-baker · 2019-08-01T18:16:27Z

I'm maybe not understanding what you're looking for, but isn't that lossy? - like repurposing the parent span can't help but break other things, right?

ellisjoe · 2019-08-01T18:58:59Z

All I'm saying is that any of the internal spans generated get removed and treated like they don't exist. So in a service that has a span stack like:

{name: "request-received", id: "1", parent: "0", traceId: "X"}
{name: "doing-work", id: "2", parent: "1", traceId: "X"}
{name: "request-sent", id: "3", parent: "2", traceId: "X"}

if the trace isn't supposed to be logged we could instead just log:

{name: "request-received", id: "1", parent: "0", traceId: "X"}
{name: "request-sent", id: "3", parent: "1", traceId: "X"}

notice the doing-work span is dropped and span 3 is now parented by 1, just as if the doing-work span was removed from the code path.

Because we're only logging the spans associated with remote calls, this ends up producing the same order of magnitude of logs as requests logs do. This really only seems nicer to me because we don't need to tweak any information we're sending and the existing tracing infra would continue to work with these logs (though we probably want to signal that the trace was downsampled).

to zipkin spec

This reverts commit 8040762.

…main closer" This reverts commit 9fe7ece.

j-baker · 2019-08-08T13:30:26Z

My concerns with that approach are roughly:

Logging is really expensive - currently logging represents about 40% of CPU load across the larger installations I am responsible for. This is obviously broken, but it's a fact of life and I am not empowered to fix this.
Tracing is throughput limited, and on larger services you already expect to be constantly dropping trace logs - I would not be comfortable adding extra redundant spans there, when we cannot currently log at the right rates.
In the world you describe, it's unclear precisely which spans are observed - the client_outgoing ones? The server incoming ones (we could log on span init, i guess)? Feels weird to change observability behaviour based on the span type.
Our trace logs aren't really integrated with our infra to the same quality as our request logs, notably I cannot query them (which is what I want this for).
We already have a fallback to request logs in the case your trace wasn't sampled, so the work to surface this stuff via request logs is already done.
Lastly, with the analysis I want to do, I am unable to do anything with sampled traces if the parent span is the property used. This represents about 5% of traffic, and 100% of traffic originating from some services (low traffic services which trace everything), which feels wrong.

svc-autorelease · 2019-08-08T13:43:37Z

Released 3.1.0

markelliot · 2019-08-08T14:40:07Z

The approach Joe describes basically already functions today because Carter fixed tracing to not create new spans for unsampled traces. If request lots simply contained the parentSpanId that'd let you reconstruct traces fully without also logging traces and/or propagating more headers on every request.

j-baker requested a review from a team as a code owner July 30, 2019 10:25

j-baker added 2 commits July 30, 2019 11:26

Add generated changelog entries

65f3cf4

j-baker force-pushed the jbaker/parent_rpc_span_id branch from f0a9029 to 72a26f0 Compare July 30, 2019 10:27

fix test

d015b35

j-baker added autorelease merge when ready labels Jul 30, 2019

iamdanfox reviewed Jul 30, 2019

View reviewed changes

ferozco reviewed Jul 30, 2019

View reviewed changes

tracing-api/src/main/java/com/palantir/tracing/api/OpenSpan.java Show resolved Hide resolved

iamdanfox reviewed Jul 30, 2019

View reviewed changes

tracing-api/src/main/java/com/palantir/tracing/api/TraceHttpHeaders.java Outdated Show resolved Hide resolved

iamdanfox reviewed Jul 30, 2019

View reviewed changes

tracing-api/src/main/java/com/palantir/tracing/api/TraceHttpHeaders.java Outdated Show resolved Hide resolved

j-baker added 2 commits July 30, 2019 14:36

PR comments

68e1018

fix accidental break

2d705c3

sfackler reviewed Jul 30, 2019

View reviewed changes

tracing-api/src/main/java/com/palantir/tracing/api/TraceHttpHeaders.java Outdated Show resolved Hide resolved

carterkozak reviewed Jul 30, 2019

View reviewed changes

tracing-okhttp3/src/main/java/com/palantir/tracing/okhttp3/OkhttpTraceInterceptor.java Show resolved Hide resolved

Update TraceHttpHeaders.java

51d90b5

j-baker added 5 commits August 8, 2019 14:16

Merge branch 'develop' into jbaker/parent_rpc_span_id

6b7b833

Use parent span id instead of originating span in order to remain closer

9fe7ece

to zipkin spec

Fast

8040762

Revert "Fast"

f4557ef

This reverts commit 8040762.

Revert "Use parent span id instead of originating span in order to re…

f459a3c

…main closer" This reverts commit 9fe7ece.

dansanduleac approved these changes Aug 8, 2019

View reviewed changes

bulldozer-bot bot merged commit 42d3119 into develop Aug 8, 2019

bulldozer-bot bot deleted the jbaker/parent_rpc_span_id branch August 8, 2019 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request log tracing #206

Request log tracing #206

j-baker commented Jul 30, 2019 •

edited

Loading

changelog-app bot commented Jul 30, 2019 •

edited by j-baker

Loading

iamdanfox Jul 30, 2019

iamdanfox Jul 30, 2019

j-baker Jul 30, 2019

sfackler Jul 30, 2019

j-baker Jul 30, 2019

sfackler Jul 30, 2019

iamdanfox commented Jul 30, 2019

sfackler commented Jul 30, 2019

ellisjoe commented Jul 30, 2019

ellisjoe commented Jul 30, 2019

j-baker commented Jul 31, 2019

ellisjoe commented Jul 31, 2019

j-baker commented Aug 1, 2019 •

edited

Loading

ellisjoe commented Aug 1, 2019 •

edited

Loading

j-baker commented Aug 8, 2019

svc-autorelease commented Aug 8, 2019

markelliot commented Aug 8, 2019

Request log tracing #206

Request log tracing #206

Conversation

j-baker commented Jul 30, 2019 • edited Loading

changelog-app bot commented Jul 30, 2019 • edited by j-baker Loading

Generate changelog in changelog/@unreleased

iamdanfox Jul 30, 2019

Choose a reason for hiding this comment

iamdanfox Jul 30, 2019

Choose a reason for hiding this comment

j-baker Jul 30, 2019

Choose a reason for hiding this comment

sfackler Jul 30, 2019

Choose a reason for hiding this comment

j-baker Jul 30, 2019

Choose a reason for hiding this comment

sfackler Jul 30, 2019

Choose a reason for hiding this comment

iamdanfox commented Jul 30, 2019

sfackler commented Jul 30, 2019

ellisjoe commented Jul 30, 2019

ellisjoe commented Jul 30, 2019

j-baker commented Jul 31, 2019

ellisjoe commented Jul 31, 2019

j-baker commented Aug 1, 2019 • edited Loading

ellisjoe commented Aug 1, 2019 • edited Loading

j-baker commented Aug 8, 2019

svc-autorelease commented Aug 8, 2019

markelliot commented Aug 8, 2019

j-baker commented Jul 30, 2019 •

edited

Loading

changelog-app bot commented Jul 30, 2019 •

edited by j-baker

Loading

Generate changelog in `changelog/@unreleased`

j-baker commented Aug 1, 2019 •

edited

Loading

ellisjoe commented Aug 1, 2019 •

edited

Loading