Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: add some span recordings to the backup resumer #65725

Merged
merged 1 commit into from
Jun 29, 2021

Conversation

adityamaru
Copy link
Contributor

@adityamaru adityamaru commented May 26, 2021

This change adds some initial span recordings to the backup
resumer that might be interesting when inspecting an executing
backup. The processor and KV level requests will be dealt
with in subsequent commits.

Informs: #64992

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@adityamaru
Copy link
Contributor Author

Just wanted to get some eyes on this general approach before I went any further. Primarily, if there are any better ideas of how we would like to organize trace recording protobufs.
I went with trying to split backup specific, and common protos that can be used by restore and import as well.

Also welcoming any high-level pointers on if we want to be more structured about where we want to add more recording. I've just used my own discretion at the moment to identify interesting phases in the resumer.

@miretskiy
Copy link
Contributor

Are we trying to be too structured? For example:

resumerSpan.RecordStructured(&jobspb.BulkJobTraceEvent{Message: "writing backup manifest"})

We already have a way of tracing string values, don't we? Just thinking out loud: what are we getting for having
these strongly typed structured traces? Are we planning on doing some sort of analysis at some point (like how many traces have x.y > z? Or is more of a log message type (just strings, possibly formatted)?

@adityamaru
Copy link
Contributor Author

We already have a way of tracing string values, don't we? Just thinking out loud: what are we getting for having
these strongly typed structured traces?

We do, but if we only rely on structured recordings then we don't need the span to be verbose. Only verbose spans accumulate free form logs, and knz had recommended sticking to lightweight structured protos to avoid perf impacts.

re: whether we should have these kinds of recordings at all, I believe they would serve as visual cues when pulling a jobs trace to help narrow down the "phase" in which a job is stuck?

@miretskiy
Copy link
Contributor

I meant types.StringValue -- it's a proto, and it has a "Value" in it. What do we get when using bulk specific proto messages?

@adityamaru
Copy link
Contributor Author

I meant types.StringValue -- it's a proto, and it has a "Value" in it. What do we get when using bulk specific proto messages?

oh yes! That I'm 100% onboard with. I just made it its own proto for the time being incase I figured we needed more bulk-specific fields in it. Will change it to types.StringValue if that is not the case.

message Spans {
repeated roachpb.Span spans = 1 [(gogoproto.nullable) = false];
}
map<int32, Spans> node_to_spans = 1 [(gogoproto.nullable) = false];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm curious about is how large we can/want these protos to be? Something like node_to_spans would grow O(ranges). Is this okay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah crossed my mind while I was adding this proto. I don't have a very good intuition about what serializing/deserializing an O(ranges) proto costs us. I believe we will need to serialize/deserialize if we pull the trace from a node other than the coordinator node, as well as when we use the existing payloads_for_span generator -

payload, err := protoreflect.MessageToJSON(item, true /* emitDefaults */)
.

I thought it would be valuable to know which span is being executed on where, but maybe we can get away with adding the span to the trace at the ExportRequest level, when the node actually serves the export request?

@adityamaru
Copy link
Contributor Author

adityamaru commented Jun 3, 2021

The first 4 commits are from the refactor in #65576, so the last two commits are where the changes have been made. I'm considering adding a separate PR with trace recordings for the external storage implementations.

@adityamaru adityamaru requested a review from pbardea June 3, 2021 00:02
@adityamaru adityamaru marked this pull request as ready for review June 3, 2021 00:40
@adityamaru adityamaru requested review from a team and removed request for a team June 3, 2021 00:40
string retryable_error = 5;
}
message Response {
util.hlc.Timestamp resp_received_time = 1 [(gogoproto.nullable) = false];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe duration would be more useful here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to duration since that seems more useful than a timestamp.

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events LGTM; same comments from the other tracing PR also apply to this (re: redaction and possible truncation of the error messages)

@@ -170,3 +170,15 @@ message RestoreProgress {
int64 progressIdx = 2;
roachpb.Span dataSpan = 3 [(gogoproto.nullable) = false];
}

message BackupPlanningTraceEvent {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps BackupProcessorPlanningTraceEvent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@adityamaru adityamaru force-pushed the add-backup-tracing branch 2 times, most recently from aa94417 to 0620495 Compare June 28, 2021 19:10
@adityamaru
Copy link
Contributor Author

The events LGTM; same comments from the other tracing PR also apply to this (re: redaction and possible truncation of the error messages)

Yup, moved the helper to a utils file in tracing, and passed errors through that.

@adityamaru adityamaru requested a review from pbardea June 28, 2021 21:59
@adityamaru
Copy link
Contributor Author

(not sure if you wanted to have another look at this)

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with naming nit


// RedactAndTruncateErrorForTracing redacts the error and truncates the string
// representation of the error to a fixed length.
func RedactAndTruncateErrorForTracing(err error) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we just call this RedactAndTruncateError since it's already in the tracing package (so callers will already see tracing.RedactAndTruncateError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

This change adds trace recordings to noteworthy places during
backup exection.

Release note: None
@adityamaru
Copy link
Contributor Author

Bazel CI is red because of a macOS cross-build failure. When I trigger a run manually it passes but doesn't seem to reflect here. Going to merge since I don't believe it is related to this PR, will bring it up with dev inf.

bors r=pbardea

@craig
Copy link
Contributor

craig bot commented Jun 29, 2021

Build succeeded:

@craig craig bot merged commit d4627db into cockroachdb:master Jun 29, 2021
@adityamaru adityamaru deleted the add-backup-tracing branch June 29, 2021 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants