feat(requestmanager): add tracing for response messages & block processing #322

rvagg · 2021-12-17T06:38:53Z

Trace synchronous responseMessage->loaderProcess->cacheProcess block storage and link those to asynchronous request->verifyBlock traces for the same blocks.

Closes: #317

rvagg · 2021-12-17T06:39:37Z

impl/graphsync_test.go

 	processUpdateSpan := tracing.FindSpanByTraceString("response(0)")
 	require.Equal(t, int64(0), testutil.AttributeValueInTraceSpan(t, *processUpdateSpan, "priority").AsInt64())
 	require.Equal(t, []string{string(td.extensionName)}, testutil.AttributeValueInTraceSpan(t, *processUpdateSpan, "extensions").AsStringSlice())
+
+	// each verifyBlock span should link to a cacheProcess span that stored it


this section verifies that the links properly exist between verifyBlock and cacheProcess spans

rvagg · 2021-12-17T07:29:04Z

I've introduced a flake here that I don't yet understand, occasionally I get a trace that's just responseMessage(x)->loaderProcess(0) whereas it should be responseMessage(x)->loaderProcess(0)->cacheProcess(0). I don't know why it doesn't get a cacheProcess captured. It would suggest that either (a) it doesn't get called or (b) the cacheProcess span isn't ended by the time the test runs.

For (a), it could be that the message contains zero responses, that would be the only way I see it not being called. I think I need to record the RequestIDs on the loaderProcess span and find out.

rvagg · 2021-12-17T07:30:08Z

requestmanager/asyncloader/responsecache/responsecache.go

+		cids = append(cids, blk.Cid().String())
+	}
+	ctx, span := otel.Tracer("graphsync").Start(ctx, "cacheProcess", trace.WithAttributes(
+		attribute.StringSlice("blocks", cids),


I'm not sure this is the best idea, it could be a long list of CIDs and that may not be helpful for tracing (maybe for logging, but maybe not for tracing). I could switch this to a "blockCount".

yea sometimes the cid lists can get long. we should just do block counts

hannahhoward · 2021-12-18T00:15:31Z

There is definitely a scenario when loaderProcess gets called without cacheProcess, at least in the current code. If the final response status comes in on its own, this could definitely happen. Also I think we may be sending blank metadata extensions as well -- we shouldn't but we do, so maybe that's something to fix

hannahhoward · 2021-12-18T00:17:39Z

But weird it would happen on the first block? It's still possible I think maybe?

hannahhoward · 2021-12-18T00:29:28Z

ah. I see it is the last responseMessage and when I look at the fail in CI, I can see what's happening.

Essentially,
responseMessage(last-1) contains the final blocks
repsonseMessage(last) simply contains the success code

Because responsemesssage(last-1) contains all the data for the request to complete, the request side completely finishes procesisng the request BEFORE it gets the final success code from the responder.

That ultimately results in the last message getting filtered out in ProcessResponses in the request manager.

It's not really an error per se?

I do think perhaps it's a bit weird on the response side to send the last block and the final code in seperate messages on the wire? Though the protocol doesn't specify that as a behavioral requirement. It'd be a bit tricky to fix though I suspect you could -- you basically have to make sure they all happen in the same transaction. As it is -- the current code doesn't do that.

hannahhoward · 2021-12-18T00:34:52Z

I'm gonna merge this probably by just dropping RepeatTraceStrings to check responseCount-1 and then let's discuss the proper fix for this in the long term, as I think we've ultimately revealed a sort of psuedo-bug/undefined behavior that I'd like to talk through solutions with you on.

…ssing Trace synchronous responseMessage->loaderProcess->cacheProcess block storage and link those to asynchronous request->verifyBlock traces for the same blocks. Closes: #317

remove cid list from span and replace with simple block count

rvagg · 2021-12-18T06:08:45Z

uh oh, first one's missing the cacheProcess in the failure @ https://github.com/ipfs/go-graphsync/runs/4567123258

"responseMessage(0)->loaderProcess(0)",
"responseMessage(1)->loaderProcess(0)->cacheProcess(0)",
"responseMessage(2)->loaderProcess(0)->cacheProcess(0)",
"response(0)->executeTask(0)",
"responseMessage(3)->loaderProcess(0)->cacheProcess(0)",

That's a bit weird.

I took a guess that we're getting the spans in a non-sequential order (the numbering in the strings is something we're generating based on the slice we're provided by otel), so I've applied sorting based on start-time to the spans as they're collected.

But sadly, the same failure: https://github.com/ipfs/go-graphsync/runs/4568079119?check_suite_focus=true

Is there something about TestGraphsyncRoundTripRequestBudgetResponder that might be causing the first message to not contain any blocks? This is flaky, so it's not consistently behaving that way (I can't repro locally).

hannahhoward · 2021-12-21T23:14:13Z

@rvagg well, so this one sucked. essentially we're the victim of go's concurrent test running and opentelemetry's use of globals.

I was able to replicate the failure (though I needed 1000x runs to do it). I did some println's and discovered the problem trace comes from the previous test run -- the peer on the incoming response is different for the problematic trace than all the others (including on success). The reason I believe is that go runs test in a go-routine, and while context cancellation shuts down most of the machinery of a test, it's also possible for the previous tests infrastructure to be not completely shut down when SetupTracing in the next test gets called. This can cause the TracingProvider global to get overridden, meaning that it can collect 1-2 traces from the last test. My solution for the time being is to make calling the "CollectTracing" function that produces the collector also cancel the context and therefore all the machinery for GraphSync. I'm not sure if this is the best solution -- it's A solution. There's a whole other issue about GraphSync shutdown behavior and making it actually imperative and synchronous. (#220)

rvagg requested a review from hannahhoward December 17, 2021 06:38

rvagg commented Dec 17, 2021

View reviewed changes

rvagg mentioned this pull request Dec 17, 2021

Tracing: looking at how incoming messages move through the system #317

Closed

rvagg commented Dec 17, 2021

View reviewed changes

hannahhoward force-pushed the rvagg/tracing branch from 3a30e8d to 8d87afc Compare December 18, 2021 01:24

rvagg and others added 4 commits December 18, 2021 16:48

feat(requestmanager): add tracing for response messages & block proce…

1584d87

…ssing Trace synchronous responseMessage->loaderProcess->cacheProcess block storage and link those to asynchronous request->verifyBlock traces for the same blocks. Closes: #317

fix(responsecache): remove cid list from span

06d8dfe

remove cid list from span and replace with simple block count

feat(asyncloader): fix flakiness in test

9d1d20c

fix(tracing,test): sort spans by start time

e4c7635

rvagg force-pushed the rvagg/tracing branch from 8d87afc to e4c7635 Compare December 18, 2021 05:57

fix(tracing): fix tracing collector

52136a5

fix(impl): update test timings

2e4bd46

hannahhoward approved these changes Dec 22, 2021

View reviewed changes

hannahhoward merged commit f49a26c into main Dec 22, 2021

rvagg deleted the rvagg/tracing branch January 6, 2022 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(requestmanager): add tracing for response messages & block processing #322

feat(requestmanager): add tracing for response messages & block processing #322

rvagg commented Dec 17, 2021

rvagg Dec 17, 2021

rvagg commented Dec 17, 2021

rvagg Dec 17, 2021

hannahhoward Dec 17, 2021

hannahhoward commented Dec 18, 2021

hannahhoward commented Dec 18, 2021

hannahhoward commented Dec 18, 2021 •

edited

Loading

hannahhoward commented Dec 18, 2021

rvagg commented Dec 18, 2021

hannahhoward commented Dec 21, 2021

feat(requestmanager): add tracing for response messages & block processing #322

feat(requestmanager): add tracing for response messages & block processing #322

Conversation

rvagg commented Dec 17, 2021

rvagg Dec 17, 2021

Choose a reason for hiding this comment

rvagg commented Dec 17, 2021

rvagg Dec 17, 2021

Choose a reason for hiding this comment

hannahhoward Dec 17, 2021

Choose a reason for hiding this comment

hannahhoward commented Dec 18, 2021

hannahhoward commented Dec 18, 2021

hannahhoward commented Dec 18, 2021 • edited Loading

hannahhoward commented Dec 18, 2021

rvagg commented Dec 18, 2021

hannahhoward commented Dec 21, 2021

hannahhoward commented Dec 18, 2021 •

edited

Loading