Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry Scope not closed correctly #25102

Open
fabrii opened this issue Apr 22, 2022 · 23 comments
Open

OpenTelemetry Scope not closed correctly #25102

fabrii opened this issue Apr 22, 2022 · 23 comments
Labels
area/tracing kind/bug Something isn't working

Comments

@fabrii
Copy link
Contributor

fabrii commented Apr 22, 2022

Describe the bug

This issue originated on the following discussion: #23882
I am opening it to keep track here.

When using Baggage propagation, the Scope is not closed correctly.

Baggage.current()
            .toBuilder()
            .put("system", "systemX")
            .build()
            .makeCurrent();

In the log, we can see the following warning:

WARN [io.qua.ope.run.QuarkusContextStorage] (vert.x-eventloop-thread-7) Context in storage not the expected context, Scope.close was not called correctly.

Expected behavior

Not showing the warning. If it is harmless, as a workaround I think we can just ignore the log using quarkus.log.category.

Actual behavior

No response

How to Reproduce?

Reproducer: https://github.com/fabrii/quarkus-playground/tree/opentelemetry

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

2.7.1

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

@fabrii fabrii added the kind/bug Something isn't working label Apr 22, 2022
@quarkus-bot
Copy link

quarkus-bot bot commented Apr 22, 2022

/cc @radcortez

@radcortez radcortez self-assigned this Apr 26, 2022
@geoand
Copy link
Contributor

geoand commented Oct 19, 2022

Is this still an issue or has it been addressed by the various fixes that have occurred over time?

@mihaipoenaru
Copy link

mihaipoenaru commented Oct 27, 2022

@geoand Google brought me to this issue because I got the same warning as above. I'm not sure if it's relevant or not.

I made a small reproducer to force the Quarkus instance to use one thread, and then I triggered a context switch for that thread to see if the traceID propagates correctly:

@Path("/hello")
class ExampleResource {

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    suspend fun hello(): String {
        println("${LocalTime.now()} ${Thread.currentThread().name}: ${Span.current().spanContext.traceId}")
        delay(3000)
        println("${LocalTime.now()} ${Thread.currentThread().name}: ${Span.current().spanContext.traceId}")
        return "hello"
    }
}

This produces the following, after I called it twice in quick succession:

17:41:21.171259400 vert.x-eventloop-thread-0: 77f541d67a9a3bda6e1b96ec97ff805b
17:41:21.698196200 vert.x-eventloop-thread-0: 330def074a9c74332334f91283c94f1a
17:41:24.184476200 vert.x-eventloop-thread-0: 77f541d67a9a3bda6e1b96ec97ff805b
2022-10-27 17:41:24,234 WARN  [io.qua.ope.run.QuarkusContextStorage] (vert.x-eventloop-thread-0) Context in storage not the expected context, Scope.close was not called correctly
17:41:24.715261 vert.x-eventloop-thread-0: 330def074a9c74332334f91283c94f1a

Fortunately, the context trace id is switched appropriately. I had some issues on my project because the old, "soon-to-be-deprecated" OpenTracing failed to do the context switch and would always keep the latest traceId.

I'm not sure if this helps with the issue at hand.

@brunobat
Copy link
Contributor

This last issue is related with the OTel context storage and restore. This is not specific to Baggage.
As mentioned here: #23882 (reply in thread)
It should be harmless.

@radcortez radcortez removed their assignment Nov 22, 2022
@geoand
Copy link
Contributor

geoand commented Jul 4, 2023

Is there anything we need to do for this one or can it be closed?

@famod
Copy link
Member

famod commented Sep 8, 2023

So I'm seeing this during a ui regression test run with Quarkus 3.3.2., even after I managed to get rid of the OTLP export timeout issues (see #35686).
I don't see any other exceptions or similar.
Any notes on debugging this?

@brunobat
Copy link
Contributor

brunobat commented Sep 11, 2023

What were you expecting to see, @famod, this message? -> WARN [io.qua.ope.run.QuarkusContextStorage] (vert.x-eventloop-thread-7) Context in storage not the expected context, Scope.close was not called correctly.

@famod
Copy link
Member

famod commented Sep 11, 2023

@brunobat I am seeing this warning multiple times but I don't see any hints of exceptions.
I'm referring to what @radcortez said in the discussion:

The OpenTelemetry instrumentation is provided by filters, but certain exceptions that break the execution flow may happen outside filters, which may cause the OpenTelemetry Scope not to be properly closed at the right time (hence the warning).

So I basically want to get rid of those warnings.

@brunobat
Copy link
Contributor

We should create a set of tests that replicate the scope closing issue and go from there...
There also some other work needed, like propagate scope along with the span and make sure it is closed with the span.

@famod
Copy link
Member

famod commented Sep 11, 2023

In the meantime, would it make sense to add more info to that warning? Something to correlate which context wasn't closed correctly?

@brunobat
Copy link
Contributor

yes, @famod. Will do

@brunobat
Copy link
Contributor

brunobat commented Sep 12, 2023

Here you have it @famod !
#35886

@famod
Copy link
Member

famod commented Oct 5, 2023

Thanks @brunobat!

I started looking at those now extended warning messages and so far, all logged "previous" traceids point to perfectly valid, non-erroneous REST calls. So for now I don't see what the problem is.

@brunobat
Copy link
Contributor

brunobat commented Oct 6, 2023

Do you have an idea of the percentage of spans that get the warning?
Can be trace back to a particular code flow or are the random?

@kdubb
Copy link
Contributor

kdubb commented Nov 25, 2023

@brunobat I see these randomly in our tests using 3.5.3. I've just cleaned our build to filter all expected error and warning logs and these are left behind; unfortunately I can't say when they began.

They seem to be very random, with hundreds of @QuarkusTest classes and many thousands of tests... I don't even see them from every complete build.

Here's an example:

Context in storage not the expected context, Scope.close was not called correctly. Details: OTel context before: {spanId=8695a6782afb78ff, traceId=d22e83c1ed0f9f64abd2baf22d449508, sampled=true}. OTel context toAttach: {spanId=9e70ae76f4e97f48, traceId=d22e83c1ed0f9f64abd2baf22d449508, sampled=true, parentId=8695a6782afb78ff}

@nicolasard
Copy link

nicolasard commented Dec 11, 2023

I have the same issue, found out that I have thise errors when the client abort the connection after sending the request.

@computerlove
Copy link
Contributor

A reproducer I created for another issue reproduces this issue.

@brunobat
Copy link
Contributor

brunobat commented Jan 3, 2024

Thanks @computerlove, Will take a look

@brunobat
Copy link
Contributor

The fix for this issue will take while due to bandwidth. Will reduce the log from info to warn because it's causing issues in prod systems.

@geoand
Copy link
Contributor

geoand commented Feb 20, 2024

I did run into this while trying to reproduce #35686

@brunobat
Copy link
Contributor

This happens when multiple spans are open and they don't close in the right order. This requires an overhaul of the way we handle span scopes and will be handled while we introduce the Observation API, @geoand

@geoand
Copy link
Contributor

geoand commented Feb 20, 2024

Understood, thanks.

@AlanMasciangelo
Copy link

Possibly another useful data point, I consistently see this after agroal does some work. Logs show jumping from executor-thread-n to agroal-n and then back and it happens immediately, I assume when the non-blocking query returns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracing kind/bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

10 participants