Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request log tracing #206

Merged
merged 11 commits into from
Aug 8, 2019
Merged

Request log tracing #206

merged 11 commits into from
Aug 8, 2019

Conversation

j-baker
Copy link
Contributor

@j-baker j-baker commented Jul 30, 2019

At the moment, we send trace ids and span ids with our requests. This is
excellent, and provides some coarse grained information. That said, it's
very limited - you can merely know that some number of requests were
triggered by the same underlying operation.

Here, when a trace is originated with a parent span, we keep track of
that parent (called an 'originating' span id) and also send it with
requests. This means that we can use request logs to piece together
accurate service level tracing as well as trace logs.

In other words, if I have a call graph which looks like:

service a/foo:
   check auth:
     call auth service:
       auth service/check auth:
   do work:
     call work service:
       work service/do operation:
         call auth service:
           auth service/check auth:

then I will see that in the trace logs (provided the request is sampled) but I can derive

service a/foo:
   auth service/check auth:
   work service/do operation:
     auth service/check auth:

from the request logs on all requests. Currently this is possible
when requests are single-threaded, but not when people call services in
parallel.

The possible downsides are extra network overhead by communicating a new header over the wire. However, this is likely to be very small (28 bytes per request), and if there is a problem with this we can always stop communicating the parent span id, which is almost surely useless as a request header.

@j-baker j-baker requested a review from a team as a code owner July 30, 2019 10:25
@changelog-app
Copy link

changelog-app bot commented Jul 30, 2019

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Add originating span id to requests in order to enable request log tracing

Check the box to generate changelog(s)

  • Generate changelog entry

j-baker added 2 commits July 30, 2019 11:26
At the moment, we send trace ids and span ids with our requests. This is
excellent, and provides some coarse grained information. That said, it's
very limited - you can merely know that some number of requests were
triggered by the same underlying operation.

Here, when a trace is originated with a parent span, we keep track of
that parent (called an 'originating' span id) and also send it with
requests. This means that we can use request logs to piece together
accurate service level tracing as well as trace logs.

In other words, if I have a call graph which looks like:

```yml
service a/foo:
   check auth:
     call auth service:
       auth service/check auth:
   do work:
     call work service:
       work service/do operation:
         call auth service:
           auth service/check auth:
```

then I will see that in the trace logs (sampled) but I can derive

```yml
service a/foo:
   auth service/check auth:
   work service/do operation:
     auth service/check auth:
```

from the request logs on all requests. Currently this is possible
when requests are single-threaded, but not when people call services in
parallel.
@j-baker j-baker force-pushed the jbaker/parent_rpc_span_id branch from f0a9029 to 72a26f0 Compare July 30, 2019 10:27
* information to reconstruct a request-level trace. For service-internal tracing, the typical trace logs
* (with sampling) are still required.
*/
String ORIGINATING_SPAN_ID = "X-OrigSpanId";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also make sure this header name is mentioned in the README so it's vaguely googleable? This is an extra piece of information that we've made up that's not in ZipKin, so would be good to make sure it's super obvious!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing the naming discussion - "incoming span id" doesn't quite make sense for the first time this gets emitted as a header, because the first value sent was just made up inside the service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to resemble a header? If we have the ability to push non-span parameters through all of the glue, it'd be nice if it were instead something like _rootSpan or _origSpanId to match with the other "special" parameters that we have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is actually sent over the wire as a header, if that makes a difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha - I was thinking this was just tracked separately on each server.

@iamdanfox
Copy link
Contributor

@sfackler for SA just in case you wanted to do something similar in https://github.com/palantir/rust-zipkin?

@sfackler
Copy link
Member

Nice! I'd been thinking about doing something like this as well.

@ellisjoe
Copy link
Contributor

No real opposition here, but have we also considered just logging all trace logs that are associated with outbound or incoming requests? That would be O(request logs) that we'd end up logging and ingesting, but would allow us to only rely on the trace log format for tracing analysis.

@ellisjoe
Copy link
Contributor

I guess you still need to plumb this through internally, but think you can get away without pushing it across the wire in that case.

@j-baker
Copy link
Contributor Author

j-baker commented Jul 31, 2019

Considered that, but:

  1. We'd then need to augment the trace log format to allow specification of this, redefine intermediate storage formats and the indexing systems in order to provide access.
  2. We'd need to augment the systems receiving the trace logs to provide separate indexing and search capabilities.
  3. By doing it this way, the information can be used by default in internal data pipelines, whereas additional investment would be required if this is in the trace log files.

@ellisjoe
Copy link
Contributor

I think 1 and 2 are handled just by saying the parentSpan of any outbound request becomes the span generated when a request is initiated, in the same way you're doing the parenting in this PR. You're just squashing the span stack and reparenting as you do so.

@j-baker
Copy link
Contributor Author

j-baker commented Aug 1, 2019

I'm maybe not understanding what you're looking for, but isn't that lossy? - like repurposing the parent span can't help but break other things, right?

@ellisjoe
Copy link
Contributor

ellisjoe commented Aug 1, 2019

All I'm saying is that any of the internal spans generated get removed and treated like they don't exist. So in a service that has a span stack like:

{name: "request-received", id: "1", parent: "0", traceId: "X"}
{name: "doing-work", id: "2", parent: "1", traceId: "X"}
{name: "request-sent", id: "3", parent: "2", traceId: "X"}

if the trace isn't supposed to be logged we could instead just log:

{name: "request-received", id: "1", parent: "0", traceId: "X"}
{name: "request-sent", id: "3", parent: "1", traceId: "X"}

notice the doing-work span is dropped and span 3 is now parented by 1, just as if the doing-work span was removed from the code path.

Because we're only logging the spans associated with remote calls, this ends up producing the same order of magnitude of logs as requests logs do. This really only seems nicer to me because we don't need to tweak any information we're sending and the existing tracing infra would continue to work with these logs (though we probably want to signal that the trace was downsampled).

@j-baker
Copy link
Contributor Author

j-baker commented Aug 8, 2019

My concerns with that approach are roughly:

  1. Logging is really expensive - currently logging represents about 40% of CPU load across the larger installations I am responsible for. This is obviously broken, but it's a fact of life and I am not empowered to fix this.
  2. Tracing is throughput limited, and on larger services you already expect to be constantly dropping trace logs - I would not be comfortable adding extra redundant spans there, when we cannot currently log at the right rates.
  3. In the world you describe, it's unclear precisely which spans are observed - the client_outgoing ones? The server incoming ones (we could log on span init, i guess)? Feels weird to change observability behaviour based on the span type.
  4. Our trace logs aren't really integrated with our infra to the same quality as our request logs, notably I cannot query them (which is what I want this for).
  5. We already have a fallback to request logs in the case your trace wasn't sampled, so the work to surface this stuff via request logs is already done.
  6. Lastly, with the analysis I want to do, I am unable to do anything with sampled traces if the parent span is the property used. This represents about 5% of traffic, and 100% of traffic originating from some services (low traffic services which trace everything), which feels wrong.

@bulldozer-bot bulldozer-bot bot merged commit 42d3119 into develop Aug 8, 2019
@bulldozer-bot bulldozer-bot bot deleted the jbaker/parent_rpc_span_id branch August 8, 2019 13:43
@svc-autorelease
Copy link
Collaborator

Released 3.1.0

@markelliot
Copy link
Contributor

The approach Joe describes basically already functions today because Carter fixed tracing to not create new spans for unsampled traces. If request lots simply contained the parentSpanId that'd let you reconstruct traces fully without also logging traces and/or propagating more headers on every request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants