-
Notifications
You must be signed in to change notification settings - Fork 182
How to record casual relationships / sequencing between sibling spans #142
Comments
Idea AHow about a If we reference the X-Trace paper: or, not as an image:
Wherever the Idea BWe could instead follow the X-Trace model more directly and create something like a Thoughts? |
Meta-comment: on RFC's let's use real things! Ex. we learned from zipkin that people rarely understood anything. How about revamping your example to use things tons of people understand. Ex. instead of client-span -> hop1 -> hop2 -> server browser -> elastic load balancer -> ha proxy -> tomcat you can then refer to these in your example. It will help, as you can establish common ground with people who are not used to Span jargon, yet :) Ex. X-Forwarded-For` can help guide discussion. One thing I learned in zipkin is that few know how social company RPC stuff, like finagle or autobahn work, so using these as examples, actually create cognitive distance rather than shortening it. Favoring the "EC2 crowd" will lower the barrier to entry in discussions like this from folks who already know zipkin etc to a wider amount of those who were left behind. /me ends meta comment for real comment, too distracted to think about a solution deeply right now, except this looks like a quite valid concern. |
@bensigelman
I think something like this is doable, although the first span that should be tagged with HappensBeforeSpan cannot know the next span ID. But it does know that it's fully finished before the next span starts. So it can emit an equivalent of pushNext annotation (or "finish-to-start" for people familiar with MS Project dependencies), and preserve its own span ID in the trace context so that the next sibling can can emit HappensAfterSpan=sid, yet still register itself as a child of the original parent span. So the remaining question is how we want to capture this in the API. |
Re the X-Trace paper: my understanding was that the top-left operation would record just the one piece of metadata, and that the In any case, the most important question is the one you end with: how best to represent this in the programmer-facing API? The safest thing (IMO) would be to start with a lower-level API and leave it at that until we have greater evidence around the particular data model. By "lower-level," I mean just some simple function calls that abstract away the particular names we choose for "Happens-Before" tag keys, etc. While I consider this topic an important one in the long-term, I don't want it to stumble into a lot of complexity that distracts us from the more pressing matter of getting publishable APIs out in go+py+js+java (or whatever else we decide to priority early on). Thoughts? |
Some background on zipkin. This scenario is supported by the shared span model. Ex. in zipkin, multiple endpoints participate in the same span. This allows you to see the server and client on the same line. This also allows you to see any proxies in the same line. Here's an example:
A decision to squash proxies is highly subjective aka policy. A presentation layer could be taught to collapse proxies with the same span id via some policy? In zipkin, the "real" destination is annotated as a tag "sa". Using this, you could implement a policy to squash hops between the client and the server. Would something like this not work? On the happens-before question (relating to clock skew), seems a separate albeit related issue. |
Also, assuming we aren't doing shared spans (which is ok by me), we could still make a type for proxies similar to this. That also would allow presentation tier to choose to squash them without larger model changes.. thoughts? https://cloud.google.com/cloud-trace/api/reference/rest/v1/projects.traces#SpanKind |
@adriancole I agree completely that RPCs can and probably should be rendered as a single row in a conventional zipkin/dapper-style UI... yet from a data modeling perspective there is still a strong case to be made for multiple spans per RPC. And the PS: I don't think (?) the dapper paper addressed this, but in older versions of stubby (google's RPC subsystem) there were sometimes user-space queuing issues in high-throughput processes... as such, the trace UI showed server time, the full end-to-end client time, but also the queueing delay on both client and server sides which were sometimes significant in terms of the global critical path. I would suggest we model things like those enqueue/dequeue events via |
@adriancole capturing firewall hop as a Log in the shared span doesn't seem useful due to the clock skew. The only thing it tells is "yep, we pass through the firewall", as the timestamp cannot be reasoned about without a lot of additional alignment logic. We (at Uber) decided to model proxy/router hops, such as haproxy, as nested spans (per initial post). It makes Dependencies graph job a bit harder, but not impossible since it just needs to know that "haproxy" service is a middleware and treat it as pass-through for the purpose of service-to-service dependency derivation. I haven't got around to implementing it yet, there will be a patch to the zipkin-dependencies. Agree on the I suggest we keep this issue open until we have a good proposal for the happens-before use case, it's the one I primarily had in mind. I think we have a general idea, just need to come up with a concrete proposal. I agree with @bensigelman that it's not a very pressing issue. |
Yuri, what you've said makes sense.
How about we rename this issue? What confused me was that i misunderstood
your goal. I thought it was to collapse middleware. it seems we are most
interested in sequence, aka happens before, right? Let's make this the
issue title since you don't want this closed until we have support for that
|
This topic of happens-before is one that circles quite often, and usually If we are to re-purpose this issue to solve that, we'd be best using the Scroll to Frame granularity and Sequencing (aka Local Spans) |
suppose another way to address this is to add a task list Ex. we've at least sequencing, if not typing (SpanKind), right? Then, once sg? |
@yurishkuro Has this issue been fixed because I noticed that my middleware authentication and authorization spans seem to finish only at the end of a trace with subsequent spans visible as a subset though they are sibling and not child spans. |
it has not been fixed. It should also be moved to the Specification repo. |
Joining this conversation from jaegertracing/jaeger-ui#390 - it looks like a solution is needed to express this kind of sequence/sibling relationship in order for visualizers like Jaeger-UI to reduce staircasing. The discussion here is quite old - how much is still true, and what needs to happen next? @adriancole mentioned defining a to-do list with the dependencies, but what actually are those dependencies right now? |
EDIT: decided below to re-focus this issue on the sequencing of sibling spans. Original title was "Provide example of reporting "middleware" hops"
Suppose we have an RPC call from Service A to Service B. In classic Zipkin service A starts a "client" span, and service B joins that span as "server". This results in a single span in the storage demarcated by cs->sr->ss->cr anotations. The new opentracing API advocates using different spans for client and server, but that's besides the point.
The question is what happens if there is some middleware between A and B that can also enrich the trace (for example, haproxy, or Hyperbahn). There may also be more than one hop through the middleware until the request reaches service B. There are two ways to represent this in the span-based tracing model
Nested spans
Issue 1: when building service dependency graph, this trace will produce a dependency A->MW->MW->B. If there are many dependencies like this, the diagram will look like everything depends on MW, and MW talks to everything, but the A->B dependency is lost.
Possible solution: mark the "hop" spans with a special attribute indicating middleware, and handle them specially when building dependency diagram.
Issue 2: if the middleware is implemented as a proxy, it makes sense that a "hop" span does not complete until the server span is complete. However, if the middleware is implemented as a messaging system, the above trace does not make sense, it should look like below.
Stacked sibling spans
Issue: in order to display the trace as shown above, especially in light of clock skews, the UI needs to know that there is a strong happened-before relationships between spans. The current DCP API does not capture that relationship, and it's not clear if it can be captured via span annotations since each stacked span knows nothing about its siblings. In contrast, X-Trace API explicitly captured these relationships by means of using
pushDown
andpushNext
operations.The text was updated successfully, but these errors were encountered: