feat: add basic tracing for responses #291

rvagg · 2021-12-02T10:38:41Z

This works so far for the responseManager.ProcessRequests() path, but it's pretty complicated; with the two main problems (building on from what I mentioned @ https://github.com/ipfs/go-graphsync/pull/290/files#r760950285 about the branching inside GraphSync#ReceiveMessage() and wanting to have a base span that lives for the entirety of an incoming message there:

Even if taking the responseManager path in isolation, each "message" can have multiple requests, each of which becomes detached from the others for processing—so making a "message" span live across all of these is going to have to take some tricky tracking, or a special wrapper that can decrement calls to "End" and call a true "End" when it reaches zero.
The fact that an incoming message can have two paths - one to responseManager and one to requestManager, with the latter having a fairly complicated journey, doubles the pain of having a parent "message" span having a discrete end.

Perhaps a parent "message" span isn't what we want, and we should just have a parent "request" span for each of the requests that an individual message can trigger? Which is a bit unfortunate because then we can't have a span that lives for the lifetime of a call through the top-level GraphSync call, but maybe that's OK?

Currently I have a parent "message" span live until all child "request" spans have been started, so there'll be a parent->child relationship, but the parent won't live much longer after the child is started. I bet different trace viewing tools will deal with that in their own unique way but I reckon it'll still retain the hierarchy.

rvagg · 2021-12-02T10:44:12Z

The test failures are all from the original tracing testing not understanding these new traces, but they're instructive to look at to see what's been added, e.g.:

TestGraphsyncRoundTrip:

          	            	 (string) (len=38) "message(0)->request(0)->executeTask(0)",
          	            	 (string) (len=10) "message(1)",
          	            	 (string) (len=25) "request(0)->newRequest(0)",
          	            	 (string) (len=31) "request(0)->terminateRequest(0)",
          	            	 (string) (len=26) "request(0)->executeTask(0)"

TestGraphsyncRoundTripRequestBudgetRequestor:

          	            	 (string) (len=39) "message(0)->request(0)->abortRequest(0)",
          	            	 (string) (len=25) "request(0)->newRequest(0)",
          	            	 (string) (len=31) "request(0)->terminateRequest(0)",
          	            	 (string) (len=26) "request(0)->executeTask(0)",
          	            	 (string) (len=10) "message(1)"

TestNetworkDisconnect:

          	            	 (string) (len=38) "message(0)->request(0)->executeTask(0)",
          	            	 (string) (len=39) "message(0)->request(0)->abortRequest(0)",
          	            	 (string) (len=38) "message(0)->request(0)->executeTask(1)",
          	            	 (string) (len=25) "request(0)->newRequest(0)",
          	            	 (string) (len=31) "request(0)->terminateRequest(0)",
          	            	 (string) (len=26) "request(0)->executeTask(0)"

And TestGraphsyncRoundTripIgnoreNBlocks has lots of message spans that go into the (so far) un traced requestManager.ProcessResponses() call.

hannahhoward · 2021-12-03T00:53:27Z

@rvagg I think we had a bit of confusion her. Yes, we should not make the message span the parent span. That will create a LOT of complexity and that's not the intention

In fact, we can ignore "message spans" entirely. BTW, the context we get on ReceiveMessage would have more value if it had existing existing spans associated with it, but it has nothing: https://github.com/ipfs/go-graphsync/blob/main/network/libp2p_impl.go#L149-L151

We should track only individual spans for requests, at least for now, and they should simply start off ResponseManager.ctx for now, inside of ResponseManager.processRequests, and only when it's a new request.

We do need a better approach eventually to link this up to data transfer spans, but that's a different ticket. I'll submit an issue shortly with how we can deal with this aspect for the time being

hannahhoward · 2021-12-03T00:59:57Z

Essentially, this should start with no message span at all:

go-graphsync/responsemanager/server.go

Line 153 in ef2bac9

    
           ctx, requestSpan := otel.Tracer("graphsync").Start(trace.ContextWithSpan(rm.ctx, messageSpan), "request", trace.WithAttributes(

and it should move here:

go-graphsync/responsemanager/server.go

Line 171 in ef2bac9

rm.connManager.Protect(p, request.ID().Tag())

so we're ignoring cancels and updates, which shouldn't start a new span.

BTW if we do want to do message spans, we should simply use "Links" to connect them to request spans, rather than creating the requestSpan's off of them. I'm not sure whether this is something we should tackle now and I really haven't messes with OT Links just yet. But according to the docks, this is exactly what you are supposed to use Links for: https://pkg.go.dev/go.opentelemetry.io/otel/trace#Link

hannahhoward

See comments

rvagg · 2021-12-08T10:25:10Z

I'm going to have to abandon this for the evening, sorry I didn't get it done by EOD in time for a release @hannahhoward but it's really close and the traces are already pretty interesting (see impl/graphsync_test.go for summaries).

The main hang-up now is in the variability of traces when observing from the tests. I'm pretty sure it's reasonable variability, getting one or more events queuing depending on the run, and I don't think I can control some of these things but need to account for that variability in the tests. I already have one test doing this where I know the variability comes from the network layer triggering multiple error events, I just need to walk through the other ones that are variable to make sure I understand why so I'm not ignoring dodgy conditions.

I also think I need to be recording pause events on a span somewhere when they happen, and maybe unpause too.

The TODOs in the code represent my outstanding tasks.

hannahhoward

Hey, various questions. Again, where the request actually ends is a bit complicated, and some of the calls are in the wrong place. See the issue for reference: #288

responsemanager/server.go

rvagg · 2021-12-09T11:21:30Z

@hannahhoward I'm slightly stuck on this now. Your response-end work has been helpful, but now I'm running into a lot of variability on the receiver side in some of the tests. Sometimes the spans show up and sometimes they don't. I don't know whether this is because those paths are not executed or because I'm capturing traces before everything's closed down and the traces have ended.

I'll keep on trying to debug this to work out why, but if the answers are obvious to you then give me some hints. I'm suspecting network-related variability that I need to accept, but it could also be synchronization issues and perhaps even actual bugs. So I'd like to figure that out to know whether I need to tighten something up, or just give up on those assertions.

The concerning tests with TODOs in them are:

TestPauseResume
TestPauseResumeRequest
TestGraphsyncRoundTripRequestBudgetRequestor

In CI now, I can see these failing fairly reliably, not even recording a response at all, so I suppose that's a speed thing. And TestPauseResumeViaUpdateOnBlockHook is in the list too. 🤦

There's also an outstanding TODO in TestGraphsyncRoundTripAlternatePersistenceAndNodes where the recorded error on the request trace can either be a SkipMe or a ContextCancelError, which is a little concerning.

hannahhoward

couple comments but this is looking good. I will take a look at the testing issues now

responsemanager/client.go

hannahhoward · 2021-12-09T16:50:30Z

responsemanager/server.go

+	span.RecordError(err)
+	span.SetStatus(codes.Error, err.Error())
+	response.span.RecordError(err)
+	response.span.SetStatus(codes.Error, err.Error())


I dunno what good span error conventions are but should we record this twice at two different levels?

Yeah, I don't know and wanted to discuss this with you. This particular one is a bit tricky, others are a little more clear I think that "this error belongs to the response" but here there's span for something called "abort" and as a user I'd expect to see what that abort was about on that span, but the error is also responsible for closing the request.
I could go either way on this, and threw up my hands and said why not both? 🌮

all good by me :)

rvagg · 2021-12-10T05:43:09Z

I think this is in a decent state now for a proper review @hannahhoward when you have time.

hannahhoward

A few small nits to consider fixing before merge, but not blocking enough I want to force a request changes cycle :)

hannahhoward · 2021-12-10T21:23:33Z

responsemanager/server.go

+		attribute.Int("id", int(update.ID())),
+		attribute.Int("priority", int(update.Priority())),
+		attribute.String("root", update.Root().String()),
+		attribute.Bool("isCancel", update.IsCancel()),


are these needed? getting to this code path requires that isCancel is always false and isUpdate is always true... seems not needed to record a value that is always the same.

also, all of these values EXCEPT the extensions should be contained in the parent span. basically, an update request ONLY contains extensions to update the request plus an id to identify the request (plus isUpdate to identify it's an update).

This is actually a pretty confusing part of the current protobuf format:

new, cancel, and update share a struct type: Request

However, Cancel is basically:

{ id: id, isCancel: true }

and update is:

{ id: id, isUpdate: true, extensions: extensions }

All the other fields are only set for new requests.

Only mentioning in case we decide we want to make this clearer, given that we're about to do big protocol changes.

OK, all very good to know. What I've done: removed all but "extensions" from the processUpdate span, also removed "isCancel" and "isUpdate" from the request span.

It looks like we can assume an "isUpdate" if we get a "processUpdate" span, but I don't know if we can assume an "isCancel" as clearly since abortRequest() can be called from both CloseWithNetworkError() and CancelResponse() as well as from the plain processRequests() path. I think that an ipldutil.ContextCancelError in the error of the span should differentiate the different paths and let you assume that it's an "isCancel".

But also, maybe this is a task for later when we get some tracing closer up to the message layer where we can lay out more specifics about the chatter.

responsemanager/server.go

fix(impl): remove accidental legacy field (#310) feat: add basic tracing for responses (#291) * feat: add basic tracing for responses * fix: add assertions for response completion * fix: improve test synchronization/drainage * fix(responsemanager): remove uninformative span flags Co-authored-by: hannahhoward <hannah@hannahhoward.net>

rvagg · 2021-12-13T02:45:04Z

oh, you already did it! didn't get to the bottom of this issue before I tried to push some changes.
All good then

* feat(channels): add missing cids field * feat(graphsync): handle missings cids via events, integration test * style(lint): mod tidy * feat(channels): put retrievals with missing CIDs in "partially complete" * style(graphsync): add explanatory comment * feat(deps): update to v0.12.0 graphsync * ci(circle): update to go 1.16 * build(go.mod): update to go 1.16

…294) This reverts commit b442027. (still contains tracing improvements and module dep updates) Co-authored-by: Jiaying Wang <42981373+jennijuju@users.noreply.github.com>

rvagg requested a review from hannahhoward December 2, 2021 10:38

hannahhoward requested changes Dec 3, 2021

View reviewed changes

rvagg force-pushed the rvagg/tracing branch from ef2bac9 to a8f570c Compare December 8, 2021 10:15

hannahhoward requested changes Dec 9, 2021

View reviewed changes

rvagg force-pushed the rvagg/tracing branch 4 times, most recently from 86fc465 to 6a28a13 Compare December 9, 2021 11:12

hannahhoward approved these changes Dec 9, 2021

View reviewed changes

rvagg force-pushed the rvagg/tracing branch 4 times, most recently from 218092a to b219119 Compare December 10, 2021 05:39

rvagg marked this pull request as ready for review December 10, 2021 05:40

rvagg changed the title ~~feat: add basic tracing for responses (WIP)~~ feat: add basic tracing for responses Dec 10, 2021

hannahhoward approved these changes Dec 10, 2021

View reviewed changes

rvagg and others added 4 commits December 10, 2021 18:51

feat: add basic tracing for responses

6b1439e

fix: add assertions for response completion

80fcd3a

fix: improve test synchronization/drainage

1c3d63a

fix(responsemanager): remove uninformative span flags

eff0a0a

hannahhoward force-pushed the rvagg/tracing branch from b219119 to eff0a0a Compare December 11, 2021 02:54

hannahhoward merged commit 02173d5 into main Dec 11, 2021

rvagg deleted the rvagg/tracing branch December 13, 2021 02:44

rvagg mentioned this pull request Dec 14, 2021

ResponseManager Tracing implementation #288

Closed

feat: add basic tracing for responses #291

feat: add basic tracing for responses #291

Uh oh!

Conversation

rvagg commented Dec 2, 2021

Uh oh!

rvagg commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hannahhoward commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hannahhoward commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hannahhoward left a comment

Choose a reason for hiding this comment

Uh oh!

rvagg commented Dec 8, 2021

Uh oh!

hannahhoward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rvagg commented Dec 9, 2021

Uh oh!

hannahhoward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hannahhoward Dec 9, 2021

Choose a reason for hiding this comment

Uh oh!

rvagg Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

hannahhoward Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

rvagg commented Dec 10, 2021

Uh oh!

hannahhoward left a comment

Choose a reason for hiding this comment

Uh oh!

hannahhoward Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

hannahhoward Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

hannahhoward Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

rvagg Dec 13, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rvagg commented Dec 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rvagg commented Dec 2, 2021 •

edited

Loading

hannahhoward commented Dec 3, 2021 •

edited

Loading

hannahhoward commented Dec 3, 2021 •

edited

Loading