Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application hangs on startup after upgrading to Spring Boot 3.2.1 with spring-data-redis and micrometer-prometheus dependencies #32996

Closed
Zernov-A opened this issue Jan 18, 2024 · 23 comments
Labels
for: external-project Needs a fix in external project in: core Issues in core modules (aop, beans, core, context, expression)

Comments

@Zernov-A
Copy link

After upgrading spring-boot to 3.2.1, I encountered an issue where the application hangs on startup. This occurs specifically when using spring-data-redis with enabled keyspace events on startup, in conjunction with the micrometer-prometheus dependencies.

Steps to Reproduce

I have prepared minimal code to reproduce this issue. Below is a simplified version of the build.gradle and DemoApp.java files:

build.gradle

plugins {
    id 'java'
    id 'org.springframework.boot' version "3.2.1"
}

repositories {
    mavenLocal()
    mavenCentral()
}

apply plugin: 'io.spring.dependency-management'

dependencies {
    implementation 'org.springframework.boot:spring-boot-starter-data-redis'
    runtimeOnly 'org.springframework.boot:spring-boot-starter-actuator'
    runtimeOnly 'io.micrometer:micrometer-registry-prometheus'
    runtimeOnly 'io.micrometer:micrometer-tracing-bridge-brave'
}

wrapper {
    gradleVersion = "8.5"
}

DemoApp.java

@SpringBootApplication
@EnableRedisRepositories(enableKeyspaceEvents = RedisKeyValueAdapter.EnableKeyspaceEvents.ON_STARTUP)
public class DemoApp {

    public static void main(String[] args) {
        new SpringApplication(DemoApp.class).run(args);
        System.out.println("Complete!");
    }
}

Observations

  • The issue persists across both Spring Boot versions 3.2.1 and 3.2.0.
  • The application starts normally without hanging when using Spring Boot version 3.1.7 and below.

Additional Information

In a typical scenario, the project includes entity and repository classes, along with other dependencies. However, the provided code is a minimal representation.

  • Removing enableKeyspaceEvents = RedisKeyValueAdapter.EnableKeyspaceEvents.ON_STARTUP or changing its value to ON_DEMAND or OFF allows the application to start correctly.

  • Removing any of the dependencies results in the application starting correctly.

  • Disabling Prometheus metrics export by setting management.prometheus.metrics.export.enabled=false in application.properties also allows the application to start correctly.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Jan 18, 2024
@scottfrederick
Copy link
Contributor

This appears to be a duplicate of spring-projects/spring-data-redis#2814, which the Micrometer team is trying to get to the bottom of. @Zernov-A can you do a thread dump when the application hangs and confirm that it looks similar to the one in that issue?

@scottfrederick scottfrederick added the status: waiting-for-feedback We need additional information before we can continue label Jan 18, 2024
@Zernov-A
Copy link
Author

Thank you. It's really very similar.
Here is the thread dump:
threaddump.txt

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Jan 18, 2024
@scottfrederick
Copy link
Contributor

Thanks @Zernov-A. The thread dumps are very similar, with the same libraries in the call stack. I'll close this issue in favor of the Spring Data Redis issue that is already being investigated. Please follow that issue for updates.

@scottfrederick scottfrederick closed this as not planned Won't fix, can't repro, duplicate, stale Jan 18, 2024
@scottfrederick scottfrederick added for: external-project Needs a fix in external project and removed status: waiting-for-triage An issue we've not yet triaged or decided on status: feedback-provided Feedback has been provided labels Jan 18, 2024
@jhengy
Copy link

jhengy commented Jun 7, 2024

@scottfrederick My team is facing a similar issue in production. It seems that currently both this issue and spring-projects/spring-data-redis#2814 were closed and labelled as for external-project. May I know if there's any team working on this bug? If so, can you point me to that team please? Thank you

@wilkinsona
Copy link
Member

@jhengy, please capture a thread dump from the hung application and we should then be able to point you in the right direction.

@jhengy
Copy link

jhengy commented Jun 7, 2024

Hi @wilkinsona, thanks for the prompt reply, I've created a simple reproducer and also captured the thread dump.

Note that for our case, unlike the above, we did not enable keyspace events for redis on startup, what triggered this issue happens to be a bean definition for ReactiveRedisMessageListenerContainer.

@wilkinsona wilkinsona reopened this Jun 7, 2024
@wilkinsona wilkinsona added status: waiting-for-triage An issue we've not yet triaged or decided on and removed for: external-project Needs a fix in external project labels Jun 7, 2024
@wilkinsona
Copy link
Member

Thanks, @jhengy.

As you've seen, you can work around this issue with things like @Lazy, however, for a more fundamental fix, I think some changes in Spring Framework will be required. Framework currently calls third-party code while holding a lock and doing so is always at risk of deadlock or livelock in a multi-threaded environment.

We'll transfer this to the Framework team so that they can investigate further.

Tagging @mp911de as well as Lettuce is heavily involved here. I think what's happening is that creating the Redis message listener container on the main thread while Framework's singleton lock is held results in some work being done on lettuce-nioEventLoop-4-1 to establish the connection. The lock will not be released until the listener has been created. Due to instrumentation of Lettuce, establishing the connection requires a bean that's accessed through an ObjectProvider. Providing the object requires the singleton lock which is held by main. We're stuck at this point as the lock won't be released until the connection to Redis has been established and the listener container has been created and the connection won't be fully established until the lock has been released and the bean can be provided by the object provider.

@bclozel bclozel transferred this issue from spring-projects/spring-boot Jun 10, 2024
@bclozel bclozel added the in: core Issues in core modules (aop, beans, core, context, expression) label Jun 10, 2024
@jhoeller
Copy link
Contributor

Given that we revised the core container towards lenient locking for 6.2 already (#23501), does this scenario work against 6.2.0-M3? I'm afraid there is nothing we can do in 6.1.x as a quick measure, I rather recommend working around this through not waiting for the result of any asynchronous steps in the bean initialization method itself. I recommend doing so wherever possible even when running against 6.2, actually.

@wilkinsona
Copy link
Member

does this scenario work against 6.2.0-M3?

I'm not sure. Upgrading the sample to 6.2.0-M3 results in the following failure:

Caused by: org.springframework.context.ApplicationContextException: Component scan could not be used with conditions in REGISTER_BEAN phase: [org.springframework.boot.autoconfigure.condition.OnBeanCondition@629ae7e]
        at org.springframework.context.annotation.ConfigurationClassParser.doProcessConfigurationClass(ConfigurationClassParser.java:325)
        at org.springframework.context.annotation.ConfigurationClassParser.processConfigurationClass(ConfigurationClassParser.java:267)
        at org.springframework.context.annotation.ConfigurationClassParser.processMemberClasses(ConfigurationClassParser.java:409)
        at org.springframework.context.annotation.ConfigurationClassParser.doProcessConfigurationClass(ConfigurationClassParser.java:294)
        at org.springframework.context.annotation.ConfigurationClassParser.processConfigurationClass(ConfigurationClassParser.java:267)
        at org.springframework.context.annotation.ConfigurationClassParser.processImports(ConfigurationClassParser.java:599)
        ... 18 common frames omitted

@snicoll may be able to give us a feel for it with his branch of Spring Boot that contains the necessary changes for Framework 6.2 compatibility. I say "a feel for it", as whether or not the problem occurs is quite delicate. For example, it does not occur in the supplied sample when the seemingly unrelated org.springframework.cloud:spring-cloud-stream-binder-kafka dependency is removed.

not waiting for the result of any asynchronous steps in the bean initialization method itself

This would require a change in Spring Data Redis where creating the instance of ReactiveRedisMessageListenerContainer gets a connection from Lettuce which blocks the main thread until the connection has been established on a Netty-managed thread.

@jhoeller
Copy link
Contributor

This would require a change in Spring Data Redis where creating the instance of ReactiveRedisMessageListenerContainer gets a connection from Lettuce which blocks the main thread until the connection has been established on a Netty-managed thread.

Alternatively, the arrangement may also enforce all of the required singleton beans to be initialized beforehand. Accessing an existing singleton bean does not require the singleton lock, that's just necessary for creating it on demand.

@wilkinsona
Copy link
Member

enforce all of the required singleton beans to be initialized beforehand

That would require changes in Micrometer (Tracing). Boot currently defers the creation of the Tracer using ObjectProvider and a singleton supplier as there's a cycle when Prometheus Examplars are involved because the MeterRegistry can depend on the Tracer (Exemplars) and the Tracer can depend on the MeterRegistry (recording metrics).

/cc @jonatan-ivanov

@snicoll
Copy link
Member

snicoll commented Jun 11, 2024

I can confirm the sample fails as Andy described with Spring Boot 3.3.1-SNAPSHOT and the modifications I've added to make it build cleanly with Spring Framework 6.2.0-SNAPSHOT. The exception is thrown from an area that was changed in 6.2.x so it might not be totally related and perhaps the lock still holds once we're past that.

The problem is in org.springframework.cloud.function.context.config.ContextFunctionCatalogAutoConfiguration.PlainFunctionScanConfiguration that does component scan while also checking if a FunctionCatalog bean is present. I've created spring-cloud/spring-cloud-function#1149

@snicoll
Copy link
Member

snicoll commented Jun 11, 2024

@jhoeller with ContextFunctionCatalogAutoConfiguration excluded, the application starts cleanly with Spring Framework 6.2.0-SNAPSHOT.

@jhoeller
Copy link
Contributor

Thanks, @snicoll! From a Spring Framework perspective, we seem to have a strategic solution for these kinds of scenarios as of 6.2 then.

Against 6.1.x, it will have to be on the workarounds suggested above, either in Spring Data or in Micrometer. As mentioned, it is generally recommendable to relax such asynchronous init method arrangements and not rely on the container's locking behavior there. Any such change made in Spring Data or Micrometer will make sense against 6.2 as well, even if not strictly necessary there.

@wilkinsona
Copy link
Member

@jonatan-ivanov @mp911de, can you please take a look at this? With spring-projects/spring-data-redis#2814 having previously been closed in favor of a Boot issue, we're in danger of going round in circles without identifying a fix.

@mp911de
Copy link
Member

mp911de commented Jun 11, 2024

Spring Data Redis isn't involved much. Tracing is configured on the Lettuce driver through MicrometerCommandLatencyRecorder that holds an instance of an object that circles back into the Spring context.
My envisioned fix would be moving observability bits into an earlier phase and a facility, that isn't related to context bootstrapping, similar to ApplicationStartup.

Any asynchronous database activity can suffer from that arrangement; Database activity during the container bootstrap happens on a regular basis and the number of drivers utilizing netty or any other async approach is growing.

From that perspective, I would welcome anything that removes the need for asynchronously fetching beans from a container that is being started.

@wilkinsona
Copy link
Member

Until a fix in Framework is available, I think we may have to make changes in several areas to get things working fully.

Maybe we can move key expiration listener initialization from bean initialization step (in RedisKeyValueAdapter) to phase when application context will be fully configured? This should prevent deadlock. The current scenario, if we use the EnableKeyspaceEvents.ON_STARTUP option, leads to the creation of the key expiration listener and the initiation of requests to Redis during bean initialization, and as a result we have deadlock with lazy Tracer initializing.

Is this suggestion from spring-projects/spring-data-redis#2814 not viable? Similarly, avoiding the connection retrieval in the constructor of ReactiveRedisMessageListenerContainer would also help I think.

@wilkinsona
Copy link
Member

From a Framework perspective, I think this issue can be closed. There's a fix coming in 6.2 and nothing, AIUI, that can be done in 6.1.x.

We've got spring-projects/spring-boot#40972 that's tracking the problem on the Boot side (although it may need to move to Micrometer (Tracing) as Boot's code is just trying to work around the MeterRegistry <-> Tracer cycle), and I'd welcome the re-opening of spring-projects/spring-data-redis#2814 to track adoption of the suggestion to "relax such asynchronous init method arrangements and not rely on the container's locking behavior there".

@snicoll
Copy link
Member

snicoll commented Jun 11, 2024

Thanks for the summary, Andy.

@snicoll snicoll closed this as not planned Won't fix, can't repro, duplicate, stale Jun 11, 2024
@snicoll snicoll added for: external-project Needs a fix in external project and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Jun 11, 2024
@wilkinsona
Copy link
Member

@jhengy, judging by your reproducer, you may be able to work around the problem by defining an eager SpanContextSupplier:

	@Bean
	TracerSpanContextSupplier spanContextSuppler(Tracer tracer) {
		return new TracerSpanContextSupplier(tracer);
	}
	
	static class TracerSpanContextSupplier implements SpanContextSupplier {

		private final Tracer tracer;

		TracerSpanContextSupplier(Tracer tracer) {
			this.tracer = tracer;
		}

		@Override
		public String getTraceId() {
			Span currentSpan = currentSpan();
			return (currentSpan != null) ? currentSpan.context().traceId() : null;
		}

		@Override
		public String getSpanId() {
			Span currentSpan = currentSpan();
			return (currentSpan != null) ? currentSpan.context().spanId() : null;
		}

		@Override
		public boolean isSampled() {
			Span currentSpan = currentSpan();
			if (currentSpan == null) {
				return false;
			}
			Boolean sampled = currentSpan.context().sampled();
			return sampled != null && sampled;
		}

		private Span currentSpan() {
			return this.tracer.currentSpan();
		}

	}

It starts up successfully with this bean in place.

@rethab
Copy link

rethab commented Aug 21, 2024

For spring boot 3.3, since the SpanContextSupplier bean was replaced with a SpanContext bean, it looks like we need to do the same trick, but then with the SpanContext?

This seems to work for me:

    @Bean
    SpanContext spanContext(Tracer tracer) {
        return new TracingSpanContext(tracer);
    }

    static class TracingSpanContext implements SpanContext {

        private final Tracer tracer;

        TracingSpanContext(Tracer tracer) {
            this.tracer = tracer;
        }

        @Override
        public String getCurrentTraceId() {
            Span currentSpan = currentSpan();
            return (currentSpan != null) ? currentSpan.context().traceId() : null;
        }

        @Override
        public String getCurrentSpanId() {
            Span currentSpan = currentSpan();
            return (currentSpan != null) ? currentSpan.context().spanId() : null;
        }

        @Override
        public boolean isCurrentSpanSampled() {
            Span currentSpan = currentSpan();
            if (currentSpan == null) {
                return false;
            }
            Boolean sampled = currentSpan.context().sampled();
            return sampled != null && sampled;
        }

        @Override
        public void markCurrentSpanAsExemplar() {
        }

        private Span currentSpan() {
            return this.tracer.currentSpan();
        }
    }

@rethab
Copy link

rethab commented Aug 23, 2024

Would it be possible to allow disabling exemplar support entirely @jonatan-ivanov ?

I see PrometheusExemplarsAutoConfiguration has the @ConditionalOnClass({ Tracer.class, SpanContext.class }), which made me think I could just exclude the dependency io.prometheus:prometheus-metrics-tracer-common, but then PrometheusMetricsExportAutoConfiguration fails.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Aug 23, 2024

In Micrometer, if you don't provide SpanContext it will not sample Exemplars. I guess the feature you are asking could be a Boot property to not auto-configure it.

You can also create your own @Bean of PrometheusMeterRegistry where you don't inject any SpanContext which disables Exemplars.

You can also create your own @Bean of SpanContext where you return false in isCurrentSpanSampled and null in the getters.

It's a hack but Micrometer and Boot also supports some of the Prometheus Client Properties, if you set sampleIntervalMilliseconds high enough you effectively disable Exemplars. Also, this will not solve the bean creation issue mentioned above since auto-configuration will do the same.

If you want to exclude dependencies, you need to exclude Micrometer Tracing otherwise auto-configuration will try to create a SpanContext for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: external-project Needs a fix in external project in: core Issues in core modules (aop, beans, core, context, expression)
Projects
None yet
Development

No branches or pull requests