-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(requestmanager): add tracing for response messages & block processing #322
Conversation
processUpdateSpan := tracing.FindSpanByTraceString("response(0)") | ||
require.Equal(t, int64(0), testutil.AttributeValueInTraceSpan(t, *processUpdateSpan, "priority").AsInt64()) | ||
require.Equal(t, []string{string(td.extensionName)}, testutil.AttributeValueInTraceSpan(t, *processUpdateSpan, "extensions").AsStringSlice()) | ||
|
||
// each verifyBlock span should link to a cacheProcess span that stored it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section verifies that the links properly exist between verifyBlock
and cacheProcess
spans
I've introduced a flake here that I don't yet understand, occasionally I get a trace that's just For (a), it could be that the message contains zero |
cids = append(cids, blk.Cid().String()) | ||
} | ||
ctx, span := otel.Tracer("graphsync").Start(ctx, "cacheProcess", trace.WithAttributes( | ||
attribute.StringSlice("blocks", cids), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the best idea, it could be a long list of CIDs and that may not be helpful for tracing (maybe for logging, but maybe not for tracing). I could switch this to a "blockCount"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea sometimes the cid lists can get long. we should just do block counts
There is definitely a scenario when loaderProcess gets called without cacheProcess, at least in the current code. If the final response status comes in on its own, this could definitely happen. Also I think we may be sending blank metadata extensions as well -- we shouldn't but we do, so maybe that's something to fix |
But weird it would happen on the first block? It's still possible I think maybe? |
ah. I see it is the last responseMessage and when I look at the fail in CI, I can see what's happening. Essentially, Because responsemesssage(last-1) contains all the data for the request to complete, the request side completely finishes procesisng the request BEFORE it gets the final success code from the responder. That ultimately results in the last message getting filtered out in ProcessResponses in the request manager. It's not really an error per se? I do think perhaps it's a bit weird on the response side to send the last block and the final code in seperate messages on the wire? Though the protocol doesn't specify that as a behavioral requirement. It'd be a bit tricky to fix though I suspect you could -- you basically have to make sure they all happen in the same transaction. As it is -- the current code doesn't do that. |
I'm gonna merge this probably by just dropping RepeatTraceStrings to check responseCount-1 and then let's discuss the proper fix for this in the long term, as I think we've ultimately revealed a sort of psuedo-bug/undefined behavior that I'd like to talk through solutions with you on. |
3a30e8d
to
8d87afc
Compare
…ssing Trace synchronous responseMessage->loaderProcess->cacheProcess block storage and link those to asynchronous request->verifyBlock traces for the same blocks. Closes: #317
remove cid list from span and replace with simple block count
uh oh, first one's missing the
That's a bit weird. I took a guess that we're getting the spans in a non-sequential order (the numbering in the strings is something we're generating based on the slice we're provided by otel), so I've applied sorting based on start-time to the spans as they're collected. But sadly, the same failure: https://github.com/ipfs/go-graphsync/runs/4568079119?check_suite_focus=true Is there something about TestGraphsyncRoundTripRequestBudgetResponder that might be causing the first message to not contain any blocks? This is flaky, so it's not consistently behaving that way (I can't repro locally). |
@rvagg well, so this one sucked. essentially we're the victim of go's concurrent test running and opentelemetry's use of globals. I was able to replicate the failure (though I needed 1000x runs to do it). I did some println's and discovered the problem trace comes from the previous test run -- the peer on the incoming response is different for the problematic trace than all the others (including on success). The reason I believe is that go runs test in a go-routine, and while context cancellation shuts down most of the machinery of a test, it's also possible for the previous tests infrastructure to be not completely shut down when SetupTracing in the next test gets called. This can cause the TracingProvider global to get overridden, meaning that it can collect 1-2 traces from the last test. My solution for the time being is to make calling the "CollectTracing" function that produces the collector also cancel the context and therefore all the machinery for GraphSync. I'm not sure if this is the best solution -- it's A solution. There's a whole other issue about GraphSync shutdown behavior and making it actually imperative and synchronous. (#220) |
Trace synchronous responseMessage->loaderProcess->cacheProcess block storage and link those to asynchronous request->verifyBlock traces for the same blocks.
Closes: #317