-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add basic tracing for responses #291
Conversation
The test failures are all from the original tracing testing not understanding these new traces, but they're instructive to look at to see what's been added, e.g.: TestGraphsyncRoundTrip:
TestGraphsyncRoundTripRequestBudgetRequestor:
TestNetworkDisconnect:
And TestGraphsyncRoundTripIgnoreNBlocks has lots of |
@rvagg I think we had a bit of confusion her. Yes, we should not make the message span the parent span. That will create a LOT of complexity and that's not the intention In fact, we can ignore "message spans" entirely. BTW, the context we get on ReceiveMessage would have more value if it had existing existing spans associated with it, but it has nothing: https://github.com/ipfs/go-graphsync/blob/main/network/libp2p_impl.go#L149-L151 We should track only individual spans for requests, at least for now, and they should simply start off ResponseManager.ctx for now, inside of ResponseManager.processRequests, and only when it's a new request. We do need a better approach eventually to link this up to data transfer spans, but that's a different ticket. I'll submit an issue shortly with how we can deal with this aspect for the time being |
Essentially, this should start with no message span at all: go-graphsync/responsemanager/server.go Line 153 in ef2bac9
and it should move here: go-graphsync/responsemanager/server.go Line 171 in ef2bac9
BTW if we do want to do message spans, we should simply use "Links" to connect them to request spans, rather than creating the requestSpan's off of them. I'm not sure whether this is something we should tackle now and I really haven't messes with OT Links just yet. But according to the docks, this is exactly what you are supposed to use Links for: https://pkg.go.dev/go.opentelemetry.io/otel/trace#Link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments
I'm going to have to abandon this for the evening, sorry I didn't get it done by EOD in time for a release @hannahhoward but it's really close and the traces are already pretty interesting (see impl/graphsync_test.go for summaries). The main hang-up now is in the variability of traces when observing from the tests. I'm pretty sure it's reasonable variability, getting one or more events queuing depending on the run, and I don't think I can control some of these things but need to account for that variability in the tests. I already have one test doing this where I know the variability comes from the network layer triggering multiple error events, I just need to walk through the other ones that are variable to make sure I understand why so I'm not ignoring dodgy conditions. I also think I need to be recording pause events on a span somewhere when they happen, and maybe unpause too. The TODOs in the code represent my outstanding tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, various questions. Again, where the request actually ends is a bit complicated, and some of the calls are in the wrong place. See the issue for reference: #288
86fc465
to
6a28a13
Compare
@hannahhoward I'm slightly stuck on this now. Your response-end work has been helpful, but now I'm running into a lot of variability on the receiver side in some of the tests. Sometimes the spans show up and sometimes they don't. I don't know whether this is because those paths are not executed or because I'm capturing traces before everything's closed down and the traces have ended. I'll keep on trying to debug this to work out why, but if the answers are obvious to you then give me some hints. I'm suspecting network-related variability that I need to accept, but it could also be synchronization issues and perhaps even actual bugs. So I'd like to figure that out to know whether I need to tighten something up, or just give up on those assertions. The concerning tests with TODOs in them are:
In CI now, I can see these failing fairly reliably, not even recording a response at all, so I suppose that's a speed thing. And TestPauseResumeViaUpdateOnBlockHook is in the list too. 🤦 There's also an outstanding TODO in TestGraphsyncRoundTripAlternatePersistenceAndNodes where the recorded error on the request trace can either be a SkipMe or a ContextCancelError, which is a little concerning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple comments but this is looking good. I will take a look at the testing issues now
span.RecordError(err) | ||
span.SetStatus(codes.Error, err.Error()) | ||
response.span.RecordError(err) | ||
response.span.SetStatus(codes.Error, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno what good span error conventions are but should we record this twice at two different levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't know and wanted to discuss this with you. This particular one is a bit tricky, others are a little more clear I think that "this error belongs to the response" but here there's span for something called "abort" and as a user I'd expect to see what that abort was about on that span, but the error is also responsible for closing the request.
I could go either way on this, and threw up my hands and said why not both? 🌮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all good by me :)
218092a
to
b219119
Compare
I think this is in a decent state now for a proper review @hannahhoward when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small nits to consider fixing before merge, but not blocking enough I want to force a request changes cycle :)
responsemanager/server.go
Outdated
attribute.Int("id", int(update.ID())), | ||
attribute.Int("priority", int(update.Priority())), | ||
attribute.String("root", update.Root().String()), | ||
attribute.Bool("isCancel", update.IsCancel()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these needed? getting to this code path requires that isCancel is always false and isUpdate is always true... seems not needed to record a value that is always the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, all of these values EXCEPT the extensions should be contained in the parent span. basically, an update request ONLY contains extensions to update the request plus an id
to identify the request (plus isUpdate to identify it's an update).
This is actually a pretty confusing part of the current protobuf format:
new, cancel, and update share a struct type: Request
However, Cancel is basically:
{
id: id,
isCancel: true
}
and update is:
{
id: id,
isUpdate: true,
extensions: extensions
}
All the other fields are only set for new requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only mentioning in case we decide we want to make this clearer, given that we're about to do big protocol changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, all very good to know. What I've done: removed all but "extensions"
from the processUpdate
span, also removed "isCancel"
and "isUpdate"
from the request
span.
It looks like we can assume an "isUpdate"
if we get a "processUpdate"
span, but I don't know if we can assume an "isCancel"
as clearly since abortRequest()
can be called from both CloseWithNetworkError()
and CancelResponse()
as well as from the plain processRequests()
path. I think that an ipldutil.ContextCancelError
in the error of the span should differentiate the different paths and let you assume that it's an "isCancel"
.
But also, maybe this is a task for later when we get some tracing closer up to the message layer where we can lay out more specifics about the chatter.
b219119
to
eff0a0a
Compare
fix(impl): remove accidental legacy field (#310) feat: add basic tracing for responses (#291) * feat: add basic tracing for responses * fix: add assertions for response completion * fix: improve test synchronization/drainage * fix(responsemanager): remove uninformative span flags Co-authored-by: hannahhoward <hannah@hannahhoward.net>
oh, you already did it! didn't get to the bottom of this issue before I tried to push some changes. |
* feat(channels): add missing cids field * feat(graphsync): handle missings cids via events, integration test * style(lint): mod tidy * feat(channels): put retrievals with missing CIDs in "partially complete" * style(graphsync): add explanatory comment * feat(deps): update to v0.12.0 graphsync * ci(circle): update to go 1.16 * build(go.mod): update to go 1.16
This works so far for the
responseManager.ProcessRequests()
path, but it's pretty complicated; with the two main problems (building on from what I mentioned @ https://github.com/ipfs/go-graphsync/pull/290/files#r760950285 about the branching insideGraphSync#ReceiveMessage()
and wanting to have a base span that lives for the entirety of an incoming message there:responseManager
path in isolation, each "message" can have multiple requests, each of which becomes detached from the others for processing—so making a "message" span live across all of these is going to have to take some tricky tracking, or a special wrapper that can decrement calls to "End" and call a true "End" when it reaches zero.responseManager
and one torequestManager
, with the latter having a fairly complicated journey, doubles the pain of having a parent "message" span having a discrete end.Perhaps a parent "message" span isn't what we want, and we should just have a parent "request" span for each of the requests that an individual message can trigger? Which is a bit unfortunate because then we can't have a span that lives for the lifetime of a call through the top-level
GraphSync
call, but maybe that's OK?Currently I have a parent "message" span live until all child "request" spans have been started, so there'll be a parent->child relationship, but the parent won't live much longer after the child is started. I bet different trace viewing tools will deal with that in their own unique way but I reckon it'll still retain the hierarchy.