Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(requestmanager): add tracing for response messages & block processing #322

Merged
merged 6 commits into from
Dec 22, 2021

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Dec 17, 2021

Trace synchronous responseMessage->loaderProcess->cacheProcess block storage and link those to asynchronous request->verifyBlock traces for the same blocks.

Closes: #317

processUpdateSpan := tracing.FindSpanByTraceString("response(0)")
require.Equal(t, int64(0), testutil.AttributeValueInTraceSpan(t, *processUpdateSpan, "priority").AsInt64())
require.Equal(t, []string{string(td.extensionName)}, testutil.AttributeValueInTraceSpan(t, *processUpdateSpan, "extensions").AsStringSlice())

// each verifyBlock span should link to a cacheProcess span that stored it
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section verifies that the links properly exist between verifyBlock and cacheProcess spans

@rvagg
Copy link
Member Author

rvagg commented Dec 17, 2021

I've introduced a flake here that I don't yet understand, occasionally I get a trace that's just responseMessage(x)->loaderProcess(0) whereas it should be responseMessage(x)->loaderProcess(0)->cacheProcess(0). I don't know why it doesn't get a cacheProcess captured. It would suggest that either (a) it doesn't get called or (b) the cacheProcess span isn't ended by the time the test runs.

For (a), it could be that the message contains zero responses, that would be the only way I see it not being called. I think I need to record the RequestIDs on the loaderProcess span and find out.

cids = append(cids, blk.Cid().String())
}
ctx, span := otel.Tracer("graphsync").Start(ctx, "cacheProcess", trace.WithAttributes(
attribute.StringSlice("blocks", cids),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best idea, it could be a long list of CIDs and that may not be helpful for tracing (maybe for logging, but maybe not for tracing). I could switch this to a "blockCount".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea sometimes the cid lists can get long. we should just do block counts

@hannahhoward
Copy link
Collaborator

There is definitely a scenario when loaderProcess gets called without cacheProcess, at least in the current code. If the final response status comes in on its own, this could definitely happen. Also I think we may be sending blank metadata extensions as well -- we shouldn't but we do, so maybe that's something to fix

@hannahhoward
Copy link
Collaborator

But weird it would happen on the first block? It's still possible I think maybe?

@hannahhoward
Copy link
Collaborator

hannahhoward commented Dec 18, 2021

ah. I see it is the last responseMessage and when I look at the fail in CI, I can see what's happening.

Essentially,
responseMessage(last-1) contains the final blocks
repsonseMessage(last) simply contains the success code

Because responsemesssage(last-1) contains all the data for the request to complete, the request side completely finishes procesisng the request BEFORE it gets the final success code from the responder.

That ultimately results in the last message getting filtered out in ProcessResponses in the request manager.

It's not really an error per se?

I do think perhaps it's a bit weird on the response side to send the last block and the final code in seperate messages on the wire? Though the protocol doesn't specify that as a behavioral requirement. It'd be a bit tricky to fix though I suspect you could -- you basically have to make sure they all happen in the same transaction. As it is -- the current code doesn't do that.

@hannahhoward
Copy link
Collaborator

I'm gonna merge this probably by just dropping RepeatTraceStrings to check responseCount-1 and then let's discuss the proper fix for this in the long term, as I think we've ultimately revealed a sort of psuedo-bug/undefined behavior that I'd like to talk through solutions with you on.

rvagg and others added 4 commits December 18, 2021 16:48
…ssing

Trace synchronous responseMessage->loaderProcess->cacheProcess block storage
and link those to asynchronous request->verifyBlock traces for the same blocks.

Closes: #317
remove cid list from span and replace with simple block count
@rvagg
Copy link
Member Author

rvagg commented Dec 18, 2021

uh oh, first one's missing the cacheProcess in the failure @ https://github.com/ipfs/go-graphsync/runs/4567123258

"responseMessage(0)->loaderProcess(0)",
"responseMessage(1)->loaderProcess(0)->cacheProcess(0)",
"responseMessage(2)->loaderProcess(0)->cacheProcess(0)",
"response(0)->executeTask(0)",
"responseMessage(3)->loaderProcess(0)->cacheProcess(0)",

That's a bit weird.

I took a guess that we're getting the spans in a non-sequential order (the numbering in the strings is something we're generating based on the slice we're provided by otel), so I've applied sorting based on start-time to the spans as they're collected.

But sadly, the same failure: https://github.com/ipfs/go-graphsync/runs/4568079119?check_suite_focus=true

Is there something about TestGraphsyncRoundTripRequestBudgetResponder that might be causing the first message to not contain any blocks? This is flaky, so it's not consistently behaving that way (I can't repro locally).

@hannahhoward
Copy link
Collaborator

@rvagg well, so this one sucked. essentially we're the victim of go's concurrent test running and opentelemetry's use of globals.

I was able to replicate the failure (though I needed 1000x runs to do it). I did some println's and discovered the problem trace comes from the previous test run -- the peer on the incoming response is different for the problematic trace than all the others (including on success). The reason I believe is that go runs test in a go-routine, and while context cancellation shuts down most of the machinery of a test, it's also possible for the previous tests infrastructure to be not completely shut down when SetupTracing in the next test gets called. This can cause the TracingProvider global to get overridden, meaning that it can collect 1-2 traces from the last test. My solution for the time being is to make calling the "CollectTracing" function that produces the collector also cancel the context and therefore all the machinery for GraphSync. I'm not sure if this is the best solution -- it's A solution. There's a whole other issue about GraphSync shutdown behavior and making it actually imperative and synchronous. (#220)

@hannahhoward hannahhoward merged commit f49a26c into main Dec 22, 2021
@rvagg rvagg deleted the rvagg/tracing branch January 6, 2022 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tracing: looking at how incoming messages move through the system
2 participants