Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry support #633

Merged
merged 14 commits into from
Aug 11, 2022
Merged

Add OpenTelemetry support #633

merged 14 commits into from
Aug 11, 2022

Conversation

alesj
Copy link
Contributor

@alesj alesj commented Jun 29, 2022

This PR adds OpenTelemetry support.
It abstract previous OpenTracing / Jaeger usage -- adding thin abstraction layer to hide actual implementation usage.
Most of the old code was used as it is, only poll or consume was changed away from interceptor usage - as we couldn't model trace/spans as expected.
Async send needed some extra work for OTel, due to its ThreadLocal current span propagation.

Current bridge.tracing property must be set to opentelemetry to use OTel tracing.

@alesj alesj requested a review from ppatierno June 29, 2022 14:04
@alesj alesj mentioned this pull request Jun 29, 2022
@ppatierno ppatierno changed the title Fix how send/poll and its messages are referenced Add OpenTelemetry support Jun 30, 2022
@tombentley
Copy link
Member

@alesj please could you provide a description for this PR, and an issue reference if there is one? Thanks.

@alesj
Copy link
Contributor Author

alesj commented Jul 5, 2022

@alesj please could you provide a description for this PR, and an issue reference if there is one? Thanks.

Short description provided.
No issue ref afaik.

@ppatierno ppatierno added this to the 0.22.0 milestone Jul 6, 2022
@ppatierno
Copy link
Member

I created a corresponding issue for tracking #636

@scholzj scholzj modified the milestones: 0.21.6, 0.22.0 Jul 6, 2022
Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alesj! I left a few comments and questions.

pom.xml Outdated Show resolved Hide resolved
@BeforeEach
public void setUp() {
assumeServer(String.format("http://%s:%s", Urls.BRIDGE_HOST, Urls.BRIDGE_PORT)); // bridge
assumeServer("http://localhost:16686"); // jaeger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if we could use testcontainers to spin up a server. I admit I've no idea how much work that would really, but maybe it's not so much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With testcontainers it's chicken-n-egg problem ... you would first need to build latest Bridge image, only to be used in tests then.
Perhaps I can easily make it work with running Application::main with the right args ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I know why I added that service.name property to Bridge config ... otherwise you can only enable OpenTracing via env vars ... which makes it hard (if not impossible) to test.

Copy link
Contributor Author

@alesj alesj Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this is then required - custom generic containers:

I can give it a shot if it's really needed ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at Ales's branch I'm inclined to thiat bridge support for the test-clients would be a great addition. @ppatierno @scholzj wdyt?

@alesj alesj force-pushed the otel_refs_fix branch 2 times, most recently from 7eee00a to 36240b5 Compare July 8, 2022 14:15
@alesj alesj force-pushed the otel_refs_fix branch 2 times, most recently from 9bd61e1 to 65d6e10 Compare July 12, 2022 09:05
protected abstract TracingOptions tracingOptions();

@Test
public void testSmoke(VertxTestContext context) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what we are testing here. Actually we are using the vertx-tracing support just for enabling tracing on the client side (because the Vertx instance we are creating is used for the web client part) but then we are sending/receiving messages but when actually the tracing part is tested? Tbh I don't think we had anything before as well, but this test is not adding much more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, atm the 'test' only adds simple automation of send+receive -- so you don't have to do it completely manually (via curl, postman, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tombentley do you think we should keep it anyway or just remove? Maybe opening an issue about adding proper tracing tests later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe opening an issue about adding proper tracing tests later?

+1

@ppatierno
Copy link
Member

With the current status of the PR we are avoiding the hack we had before about tracking spans with a UUID for using the right tracing context in the Kafka producer interceptor due to the async nature of Vert.x and different threads usage.
We are now using the VertxContextStorageProvider which an implementation of the context storage for OpenTelemetry in the Vert.x land which takes into account bringing OpenTelemetry context via Vert.x context.

Unluckily the current solution has a problem when an HTTP client uses the "keep-alive" connection enabled.
In this scenario, when the bridge is used to send a message, on the first request the HTTP send span and the underneath Kafka send message span are related correctly (the latter is CHILD OF the former).
All the subsequent HTTP send requests are not traced correctly because the above two spans are totally independent.
After a long investigation, it turned out that it is due to the Vert.x Kafka client producer on the bridge being the same and reused on all HTTP requests when "keep-alive" connection is enabled.
The issue is that the Vert.x Kafka client is taking the initial Vert.x context when it's created for the first time (on the first HTTP request and connection established), and then it uses this one for executing any other blocking call (see the Kafka send).
Of course, each incoming HTTP request has its own context (bringing tracing) instead.
So the first time everything works as expected because the two contexts are actually the same. From the second request, the Kafka client uses the first "old" context (with no tracing anymore) and not the right "current" context (bringing tracing).

Of course with no "keep-alive" everything works fine each time, because a new Kafka client is created for every HTTP request and the context is always the "right" one.
We are in touch with Vert.x people to get advice and about the behaviour of the Vert.x Kafka client.

Meanwhile we already have the "workaround" which is actually a middle solution (pushed by Ales during these days to avoid the above mentioned hack) about using an executor service which is wrapped through the Context (in OpenTelemetry) in order to use the right context in Vert.x.
This solution was suggested by OpenTelemetry people and it seems to be the "official" one when you don't have the right context storage (which we have but driving to the above problem).

This is more or less the status of the art.

If there is no trick for the Vert.x Kafka client or we would need a change on the client so waiting for a new release, I would move forward with this PR by using the last "workaround" and then tracking an issue to come back using the storage later.

Last consideration is that Vert.x seems to have an official support for tracing now, both for OpenTelemetry and OpenTracing but seems to be still in TP and anyway using it would need a more bigger refactoring on the bridge because such support handle the HTTP part of the tracing for us our of box, so we should remove our code. I am against this solution in the near future.

I would like to know what the other maintainers think about the overall picture of this. @tombentley @scholzj ?

@ppatierno
Copy link
Member

Thanks to Julien from the Vert.x team, we have got a fix in the Vert.x Kafka client (vert-x3/vertx-kafka-client#226) which I tested and seems to work. Of course, this will be available in the next Vert.x 4.3.3 release. An ETA seems to be August for having this release.
At this point we have two choices: updating this PR with the workaround of wrapping the Vert.x executor context while waiting for the Vert.x 4.3.3 release with a proper fix or ... waiting for the Vert.x 4.3.3 release and not merging this PR yet.
I have no strong opinion. Waiting for the Vert.x release could mean even another month, but I don't think we are going to have another bridge release soon.

@tombentley @scholzj what's your feeling on this?

Signed-off-by: Ales Justin <ales.justin@gmail.com>
Signed-off-by: Ales Justin <ales.justin@gmail.com>
Signed-off-by: Ales Justin <ales.justin@gmail.com>
Remove OTel span propagation hack.
Simplify executor service adapt.

Signed-off-by: Ales Justin <ales.justin@gmail.com>
Signed-off-by: Ales Justin <ales.justin@gmail.com>
Signed-off-by: Ales Justin <ales.justin@gmail.com>
Signed-off-by: Ales Justin <ales.justin@gmail.com>
pom.xml Outdated Show resolved Hide resolved
Signed-off-by: Ales Justin <ales.justin@gmail.com>
@ppatierno
Copy link
Member

Finally Vert.x 4.3.3 is out! @alesj could you update the PR by using this release and test again that everything works fine please?
I am ok with the PR but I will have another final pass.
@scholzj @tombentley could you review please, it would be great to finally close this addition after so long time :-)

Signed-off-by: Ales Justin <ales.justin@gmail.com>
Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alesj I left some comments mostly around dependencies that can be safely removed

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
Signed-off-by: Ales Justin <ales.justin@gmail.com>
Signed-off-by: Ales Justin <ales.justin@gmail.com>
@@ -96,70 +85,45 @@ public void handle(Endpoint<?> endpoint) {

boolean isAsync = Boolean.parseBoolean(routingContext.queryParams().get("async"));

Tracer tracer = GlobalTracer.get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR I guess, but this handle() method is pretty complex. We should try to refactor it into smaller methods.

Signed-off-by: Ales Justin <ales.justin@gmail.com>
Signed-off-by: Ales Justin <ales.justin@gmail.com>
Signed-off-by: Ales Justin <ales.justin@gmail.com>
Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @alesj thanks for the effort!

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ales!

@ppatierno ppatierno merged commit ad42485 into strimzi:main Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants