Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serverless] Inject trace context into SQS/SNS/EventBridge #2917

Merged

Conversation

nhulston
Copy link
Contributor

@nhulston nhulston commented Oct 7, 2024

What does this PR do?

This PR injects trace context into:

  • SQS SendMessage and SendMessageBatch messages
  • SNS Publish and PublishBatch messages
  • EventBridge PutEvents events

This allows the Lambda extension to use this trace context to create inferred spans and combine these spans into a single trace. (See PR 1 and PR 2 for these changes)

Motivation

SNS and SQS are already supported in Java and .NET. However, this Go tracer did not have support for SQS, SNS, or EventBridge.

I am also working on a similar PR in dd-trace-java and dd-trace-dotnet to add EventBridge trace context injection. This PR differs because it adds support for SQS, SNS, and EventBridge.

Notes for reviewers

  • I organized the PR commit by commit, so you can review that way if you want.
  • I tried to structure the code in a way that makes it easy to (1) add "monkey patching" for new services, and (2) add "monkey patching" to new operations to these existing services in the future. Hence, in eventbridge.go, we have a switch statement with a single case right now. But this will make it easier for future operations to be added.

Testing

I added unit tests, which can be ran with the commands:

go test ./contrib/aws/aws-sdk-go-v2/aws -v
go test ./contrib/aws/internal/sqs -v
go test ./contrib/aws/internal/sns -v
go test ./contrib/aws/internal/eventbridge -v 

I also did manual tests. Here are what the flame graphs look like after these changes:
Screenshot 2024-10-07 at 3 52 34 PM

System tests covering SNS/SQS: DataDog/system-tests#3204

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

Unsure? Have a question? Request a review!

@github-actions github-actions bot added the apm:ecosystem contrib/* related feature requests or bugs label Oct 7, 2024
@nhulston nhulston changed the title [serverless] Inject trace context into SQS/SNS/EventeBridge [serverless] Inject trace context into SQS/SNS/EventBridge Oct 7, 2024
@pr-commenter
Copy link

pr-commenter bot commented Oct 7, 2024

Benchmarks

Benchmark execution time: 2024-10-16 21:48:16

Comparing candidate commit edb0584 in PR branch nicholas.hulston/sqs-sns-eventbridge-inject-trace-context with baseline commit 51abc7b in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 58 metrics, 1 unstable metrics.

@nhulston nhulston force-pushed the nicholas.hulston/sqs-sns-eventbridge-inject-trace-context branch from 5920275 to 7237e7f Compare October 8, 2024 15:27
Copy link
Contributor

@rarguelloF rarguelloF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor changes to make sure we don't expose packages if its not necessary 🙏

contrib/aws/aws-sdk-go-v2/aws/eventbridge/eventbridge.go Outdated Show resolved Hide resolved
contrib/aws/aws-sdk-go-v2/aws/eventbridge/eventbridge.go Outdated Show resolved Hide resolved
contrib/aws/aws-sdk-go-v2/aws/sns/sns.go Outdated Show resolved Hide resolved
contrib/aws/aws-sdk-go-v2/aws/sns/sns.go Outdated Show resolved Hide resolved
contrib/aws/aws-sdk-go-v2/aws/sqs/sqs.go Outdated Show resolved Hide resolved
contrib/aws/aws-sdk-go-v2/aws/sqs/sqs.go Outdated Show resolved Hide resolved
@nhulston nhulston force-pushed the nicholas.hulston/sqs-sns-eventbridge-inject-trace-context branch 2 times, most recently from 73ba8f2 to f917961 Compare October 8, 2024 19:57
@nhulston nhulston marked this pull request as ready for review October 9, 2024 21:27
@nhulston nhulston requested review from a team as code owners October 9, 2024 21:27
@nhulston nhulston force-pushed the nicholas.hulston/sqs-sns-eventbridge-inject-trace-context branch from a76e6a2 to f95440c Compare October 10, 2024 14:46
@nhulston nhulston force-pushed the nicholas.hulston/sqs-sns-eventbridge-inject-trace-context branch from 8479f85 to 80be4e3 Compare October 10, 2024 15:42
Copy link
Contributor

@purple4reina purple4reina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if it seems like I am nit picking you a lot here. Let me explain.

Customers care a lot about the performance of their apps. Since the instrumentation you are writing here gets run for every message send performed, it's important that we keep the overhead as low as possible.

Since our customers are expecting low overhead, this can sometimes lead to code that is more complex than otherwise necessary.

…race context bytes only once and reusing it for batch requests
… it for SNS/SQS, and (2) checking new eventbridge payload size before marshaling the updated detail
Copy link
Contributor

@rarguelloF rarguelloF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just echoing @purple4reina's comments about extracting the repeated logic out of the for loop.

contrib/aws/internal/eventbridge/eventbridge.go Outdated Show resolved Hide resolved
if err != nil {
log.Debug("Unable to get trace context: %s", err.Error())
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit.

I think you could simplify this out even further. Since both handlePublish and handlePublishBatch call getTraceContext, you could make the getTraceContext call just once, outside of the handle functions and then pass the context objects into the handle functions.

This makes no functional change, but it would DRY up the code a bit. That would be an improvement, since if we need to update the getTraceContext call in the future, we'd right now need to update it in two locations. Making the getTraceContext call just once would mean updating code in a single location.

Copy link
Contributor Author

@nhulston nhulston Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is an easy fix and I'm open to doing it. But I'm trying to structure the EnrichOperation func in a way that allows other teams to modify other operations in SQS/SNS. I think there could be other operations where they won't need the trace context, so it might make more sense to keep the getTraceContext() calls in the handle funcs. For example, this in the java tracer:
https://github.com/DataDog/dd-trace-java/blob/master/dd-java-agent/instrumentation/aws-java-sqs-2.0/src/main/java/datadog/trace/instrumentation/aws/v2/sqs/SqsInterceptor.java#L84-L90

What do you think?

Copy link
Contributor

@purple4reina purple4reina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you do in fact have an opportunity to pull a lot of work in eventbridge out of the for loop so it is only performed a single time.

Additionally, you did a great job with the java and dotnet tracers to inspect json payloads as strings and adding trace context manually. I think you can do the same thing here too.

@rarguelloF rarguelloF removed their request for review October 16, 2024 08:42

// Prepare the reused trace context string
startTimeMillis := time.Now().UnixMilli()
reusedTraceContext := fmt.Sprintf(`%s,"%s":"%d"`, carrierJSON[:len(carrierJSON)-1], startTimeKey, startTimeMillis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that sets the trace prop style to none? I'm concerned in that case we'd end up with invalid json, like {,"start":123}.

I think the env var you want is DD_TRACE_PROPAGATION_STYLE=none.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually, could probably avoid any such issues if you manually add the start time key to the carrier.

carrier[startTimeKey] = strconv.Itoa(startTimeMillis)
carrierJSON, err := json.Marshal(carrier)
etc...

This way you're always guaranteed at least one item in the map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Copy link
Contributor

@rarguelloF rarguelloF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@purple4reina purple4reina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dope

@darccio darccio enabled auto-merge (squash) October 22, 2024 08:29
@darccio darccio merged commit a118199 into main Oct 22, 2024
144 of 145 checks passed
@darccio darccio deleted the nicholas.hulston/sqs-sns-eventbridge-inject-trace-context branch October 22, 2024 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:ecosystem contrib/* related feature requests or bugs serverless
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants