Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-sdk SQS propagation #5167

Closed
blumamir opened this issue Jan 18, 2022 · 7 comments
Closed

aws-sdk SQS propagation #5167

blumamir opened this issue Jan 18, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@blumamir
Copy link
Member

Hello,
I am trying to make aws-sdk SQS context propagation work between applications written in Java (scala actually), and nodejs / ruby.
In order for it to work, both the sender and receiver should agree on how to inject and extract the context onto the messages.

Unfortunately, Java and node \ ruby behave differently which creates broken traces for end-users who are sending messages across those systems.

NodeJS \ Ruby

These implementations use the "OpenTelemetry" approach. They use the propagators registered in the otel API (w3c \ b3 \ custom \ etc) to inject and extract context from the message attributes

Cons

There are quotas on SQS messages which the context propagation consumes: (https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html) which means:

  1. User is charged for this payload which includes the message attributes. It is not expected to be expensive as "Each 64 KB chunk of a payload is billed as 1 request", but still it's a few dozen bytes that are charged.
  2. If the message is just a bit below the 256K hard limit, then adding the context data can potentially cross this limit and reject the message.
  3. Message attributes are limited to 10 values total, including the user values, which means if he already used the available amount, then instrumentation has no more space to inject propagation headers as well.

Java

The jave implementation is using the X-Amzn-Trace-Id header, which does not consume quotas like the Nodejs \ Ruby implementation, but has the following cons:

Cons

  1. If X-Ray is enabled on a service, it might inject additional spans into the trace as the request is passing via the AWS x-ray enabled services. These spans are only exported to X-Ray. It means that otel users which are not using X-Ray, will have missing spans in the trace, and thus the goal to have async messaging visibility is lost.
  2. For some services, if X-Ray is disabled, the service will still create a new trace, flag it as a non-sampled trace and propagate this information downstream. If the application is configured with parent-based sampler, then x-ray effectivly turns off tracing for the application. See this issue in node for reference and more info.
  3. X-Ray propagator does not support baggage, which means baggage values are not propagated via AWS services even if user registers the W3C Baggage propagator.

Action Items

Considering the above Pros and Cons, I want to suggest adding a second propagation style into the aws-sdk instrumentations, which is compatible with nodejs / ruby and allow the user to bypass the x-ray propagator so the above compatibility issues will not affect his application.

I'll be very happy to get more insights and ideas on this issue. Do you think it makes sense adding this new propagation style? Maybe AWS has plans to solve the compatibility problems in the near future?

@blumamir blumamir added the enhancement New feature or request label Jan 18, 2022
@anuraaga
Copy link
Contributor

Hi @blumamir - currently the spec is only in the context of Lambda, but the concept is similar to any handling of an SQS message I think.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/instrumentation/aws-lambda.md#sqs-message

For some services, if X-Ray is disabled, the service will still create a new trace, flag it as a non-sampled trace and propagate this information downstream. If the application is configured with parent-based sampler, then x-ray effectivly turns off tracing for the application. See this issue in node for reference and more info.

IIRC, this only happens when API gateway is involved as it is actually the one creating a trace in that issue. When disabled, it should otherwise be an invalid context with the value "Sampled=0" and no IDs. But there are some bugs in API gateway that cause problems unfortunately :(

X-Ray propagator does not support baggage, which means baggage values are not propagated via AWS services even if user registers the W3C Baggage propagator.

For reference, there is current work on supporting baggage and w3c headers within AWS services, though it will take a while to make it to existing services like SQS.

The biggest problem with avoiding AWSTraceHeader is to lose connectivity between AWS services - for example, it is the only way to trace a request from client (AWS SDK) -> S3 -> SQS. In general, there will be less problems using AWS propagation because it's the only one really recognized by the AWS service.

To work around the API gateway bug and baggage support, adding a setting here to opt-in to OTel propagation would be OK. But if your use case doesn't actually involve these, then actually it will be better served by having the other languages have an option, which should be the default similar to the spec for Lambda (we should probably update the spec to move that section to the AWS SDK instrumentation spec to indicate it's not only for Lambda). This is the only way to ensure the stablest propagation across various AWS services, S3 event bridge appsync right now, in the future we hope to have more middle-node services covered too.

Let me know if anything's not clear.

@blumamir
Copy link
Member Author

Updating here that I am working on adding an option to use the otel propagators via message attributes in SQS, to support context propagation between nodejs / ruby <-> java.

@awelless
Copy link
Contributor

I have also encountered the problem with propagating baggage via sns/sqs. I also put baggage into message attributes, only manually. Having this implemented in the instrumentation side would be great.
@blumamir, did you have time to work on this?

@blumamir
Copy link
Member Author

I have also encountered the problem with propagating baggage via sns/sqs. I also put baggage into message attributes, only manually. Having this implemented in the instrumentation side would be great. @blumamir, did you have time to work on this?

I started working on it back ago but it needs a refactor greater than I planned. I don't remember all the details, but I think the instrumentation uses some mechanism only to handle span at request time and response time, where the injection needs to happen after span creation (so we have the span id to inject) but before the request is sent on the wire (so it can include the message attributes).

@awelless
Copy link
Contributor

awelless commented Feb 21, 2023

I started working on it back ago but it needs a refactor greater than I planned. I don't remember all the details, but I think the instrumentation uses some mechanism only to handle span at request time and response time, where the injection needs to happen after span creation (so we have the span id to inject) but before the request is sent on the wire (so it can include the message attributes).

You're right. Span creation takes place in afterMarshalling method when request body is not easy to modify. However, I reckon that it should be possible to create span in beforeExecution- it's the first hook and the docs say it's guaranteed that this method is executing on the same thread preserving all thread locals.
I could try to refactor this and move all span creation logic into another method, so it will allow us to have an access to all tracing data during request modification.
What do you think?

@tylerbenson
Copy link
Member

I think this issue can be closed as Java implements this according to the spec:
https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/instrumentation/aws-lambda/#sqs-message

The main recent addition is #7970 which removes the AWS X-Ray environment variable from the propagation chain and allows the regular propagators to be applied.

Other languages may still require additional changes to implement the correct propagation.
See also open-telemetry/opentelemetry-specification#3212.

@mateuszrzeszutek
Copy link
Member

Closing. Thanks!

laurit pushed a commit that referenced this issue Mar 16, 2023
…7970)

open-telemetry/opentelemetry-specification#3166

Per discussion in the FAAS SIG, we decided that the AWS X-Ray
environment variable should be moved to a span link to avoid interfering
with the configured propagators.

Also related to #5167.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants