Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadLocals not cleared with Hooks.enableAutomaticContextPropagation(), same TraceId for every request #363

Closed
kzander91 opened this issue Sep 25, 2023 · 8 comments
Labels
bug A general bug
Milestone

Comments

@kzander91
Copy link

kzander91 commented Sep 25, 2023

I originally opened this issue in reactor: reactor/reactor-core#3584, but it turned out that this is an issue in micrometer-tracing instead.
Worked in: 1.1.4
Broken in: 1.1.5, 1.1.6-SNAPSHOT, 1.2.0-SNAPSHOT
Also related: #356


Expected Behavior

Automatic context propagation (enabled with Hooks.enableAutomaticContextPropagation()) should clear the thread-locals afterwards.

Actual Behavior

After the first request handled by a particular thread, every subsequent request handled by that thread will reuse the same traceId.

Steps to Reproduce

Demo project: reproducer.zip

@SpringBootApplication
public class Demo1Application {

    public static void main(String[] args) {
        System.setProperty("reactor.netty.ioWorkerCount", "1");
        Hooks.enableAutomaticContextPropagation();
        SpringApplication.run(Demo1Application.class, args);
    }

}

@RestController
@RequiredArgsConstructor
class MyController {

    private final Tracer tracer;

    @GetMapping("/")
    public String traceId() {
        return Thread.currentThread().getName() + '-' + tracer.currentSpan().context().traceId();
    }

}

Run it and issue a few requests to localhost:8080, the output will be something like this:

With reactor.netty.ioWorkerCount=1:

C:\>curl localhost:8080
reactor-http-nio-1-65102dd571ea1cb67a25a401c93412b0
C:\>curl localhost:8080
reactor-http-nio-1-65102dd571ea1cb67a25a401c93412b0
C:\>curl localhost:8080
reactor-http-nio-1-65102dd571ea1cb67a25a401c93412b0
C:\>curl localhost:8080
reactor-http-nio-1-65102dd571ea1cb67a25a401c93412b0
C:\>curl localhost:8080
reactor-http-nio-1-65102dd571ea1cb67a25a401c93412b0
C:\>curl localhost:8080
reactor-http-nio-1-65102dd571ea1cb67a25a401c93412b0

With reactor.netty.ioWorkerCount=2:

C:\>curl localhost:8080
reactor-http-nio-1-65102d16b3d992cb7131386bf683cd9e
C:\>curl localhost:8080
reactor-http-nio-2-65102d0dc6d2b48f307d67a6db6fde3a
C:\>curl localhost:8080
reactor-http-nio-1-65102d16b3d992cb7131386bf683cd9e
C:\>curl localhost:8080
reactor-http-nio-2-65102d0dc6d2b48f307d67a6db6fde3a
C:\>curl localhost:8080
reactor-http-nio-1-65102d16b3d992cb7131386bf683cd9e
C:\>curl localhost:8080
reactor-http-nio-2-65102d0dc6d2b48f307d67a6db6fde3a

With reactor.netty.ioWorkerCount=3:

C:\>curl localhost:8080
reactor-http-nio-2-65102dfc81d58e1d99c2f64353b410a6
C:\>curl localhost:8080
reactor-http-nio-3-65102dfc7b1ff1d38702f4b2e4cb94a6
C:\>curl localhost:8080
reactor-http-nio-1-65102dfde7c20d2604248dd885469458
C:\>curl localhost:8080
reactor-http-nio-2-65102dfc81d58e1d99c2f64353b410a6
C:\>curl localhost:8080
reactor-http-nio-3-65102dfc7b1ff1d38702f4b2e4cb94a6
C:\>curl localhost:8080
reactor-http-nio-1-65102dfde7c20d2604248dd885469458

As we can see, requests handled by a thread will always reuse the same traceId, indicating that thread-locals popuplated by automatic context propagation are not cleared/reset after each request.

Without automatic context propagation, tracer.currentSpan() returns null.

Your Environment

  • Reactor version(s) used: 3.5.10
  • Other relevant libraries versions (eg. netty, ...):
    • Spring Boot 3.1.4
    • Spring Framework 6.0.12
    • Micrometer Tracing: 1.1.5 (Also broken in 1.1.6-SNAPSHOT, 1.2.0-SNAPSHOT)
    • Micrometer Observation: 1.11.4
  • JVM version (java -version):
openjdk version "17.0.8.1" 2023-08-24
OpenJDK Runtime Environment Temurin-17.0.8.1+1 (build 17.0.8.1+1)
OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (build 17.0.8.1+1, mixed mode, sharing)
@xsreality
Copy link

I think I am also hitting this issue after upgrading to SB 3.1.4 (Micrometer Tracing 1.1.5). The traces for actuator endpoints are using the same trace ID and becoming child traces instead of independent traces.

image

@marcingrzejszczak marcingrzejszczak added this to the 1.1.6 milestone Sep 26, 2023
@marcingrzejszczak marcingrzejszczak added the bug A general bug label Sep 26, 2023
marcingrzejszczak added a commit that referenced this issue Sep 26, 2023
without this change we're clearing the scopes map upon ending of spans. The problem is that the scoping hierarchy gets broken and even though that OTLA works well (scopes are properly opened and closed) the spans are not.

with this change we're not removing the scopes upon clearing of a TraceContext

fixds gh-363
@marcingrzejszczak
Copy link
Contributor

Fixed via 5a750bb

@vpavic
Copy link

vpavic commented Oct 9, 2023

Looking at the commit that resolved this, it doesn't appear there's anything preventing a regression like this from sneaking back in.
As this is quite a fundamental thing, could you ensure it's properly tested?

@shakuzen
Copy link
Member

Yes, we should definitely do that. I've opened #397 to track it. Pull requests welcome.

@nathandeamer
Copy link

nathandeamer commented Oct 15, 2023

I think I am also hitting this issue after upgrading to SB 3.1.4 (Micrometer Tracing 1.1.5). The traces for actuator endpoints are using the same trace ID and becoming child traces instead of independent traces.

Think I am also seeing this but with Spring Cloud Gateway and latest Micrometer Tracing 1.1.6: #410

@chemicL
Copy link
Contributor

chemicL commented Oct 31, 2023

@vpavic #421 is my proposed test for this case. @shakuzen @jonatan-ivanov I tested it against 1.1.5 and it fails due to the scope invalidation following an Observation stop signal.

@vpavic
Copy link

vpavic commented Nov 14, 2023

Thanks for the ping @chemicL.

Test case in your PR is focused on Reactor scenario, and while this issue also is, the root cause behind it really isn't tied to Reactor and it affected a broader set of users (we're not using Reactor in any of our projects).

Not a critique of your PR by any means just an observation that the tests covering this should probably also include something that's not specific to a certain runtime.

@chemicL
Copy link
Contributor

chemicL commented Nov 14, 2023

@vpavic understood :) The interesting thing about it is the unique setup which reactor creates with the support for context propagation that we introduced. It was hard to replicate without the reactor in the equation, yet not impossible. Have you experienced this bug without reactor? I'd argue here that using reactor is both a means to an end – creating an environment in which we can reliably replicate the issue, and a validation that where the bug has surfaced is also covered by the tests.

If you have an idea that avoids using reactor in the test for this feature, I think it would be valuable to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug
Projects
None yet
Development

No branches or pull requests

7 participants