[BUG] Spring Boot application sometimes does not shutdown after closing client and Spring context #43345

nateha1984 · 2024-12-11T19:31:12Z

Describe the bug
Occasionally the Service Bus connection is preventing a Spring Boot application from shutting down after closing the client. We have an application that is started with a KEDA event to consume one message from a service bus queue. Most of the time, the application/job starts in AKS, consumes the message, then we close the client and the Spring context, which shuts down the app and thus stops the container. At times, there looks to be some failure in the client close code that ends up preventing the Spring app and thus the container from stopping, so it continues to run the job until we manually kill the pod.

When this happens, we can see the messages below in the logs, but no exception is logged (just the Azure SDK json messages).

Exception or Stack Trace

2024-12-11T17:14:14.985Z  WARN 1 --- [           main] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Cannot subscribe. Processor is already terminated.","exception":"Cannot subscribe. Processor is already terminated.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.987Z  INFO 1 --- [           main] c.a.c.a.i.ReactorConnectionCache         : {"az.sdk.message":"connection close finished with error.","exception":"Cannot subscribe. Processor is already terminated.","entityPath":"N/A","connectionId":"MF_b743c7_1733937252434"}
2024-12-11T17:19:14.927Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.ConnectionHandler      : {"az.sdk.message":"onConnectionRemoteClose","connectionId":"MF_b743c7_1733937252434","errorCondition":"amqp:connection:forced","errorDescription":"The connection was closed by container '7e9fe2d10e6941f592878859b2be6e61_G6' because it did not have any active links in the past 300000 milliseconds. TrackingId:7e9fe2d10e6941f592878859b2be6e61_G6, SystemTracker:gateway5, Timestamp:2024-12-11T17:19:14","hostName":"xxx-xxx-xxx.servicebus.windows.net"}

When the application DOES shutdown successfully, we see these messages at the end:

2024-12-11T16:30:16.364Z  INFO 1 --- [ctor-executor-1] c.a.c.a.implementation.ReactorExecutor   : {"az.sdk.message":"Processing all pending tasks and closing old reactor.","connectionId":"MF_31f861_1733934549999"}
2024-12-11T16:30:16.364Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorDispatcher              : {"az.sdk.message":"Reactor selectable is being disposed.","connectionId":"MF_31f861_1733934549999"}
2024-12-11T16:30:16.364Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorConnection              : {"az.sdk.message":"onConnectionShutdown. Shutting down.","connectionId":"MF_31f861_1733934549999","isTransient":false,"isInitiatedByClient":false,"shutdownMessage":"connectionId[MF_31f861_1733934549999] Reactor selectable is disposed.","namespace":"xxx-xxx-xxx.servicebus.windows.net"}
2024-12-11T16:30:16.366Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorConnectionCache         : {"az.sdk.message":"connection close finished.","entityPath":"N/A","connectionId":"MF_31f861_1733934549999"}
2024-12-11T16:30:16.366Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorConnection              : {"az.sdk.message":"onConnectionShutdown. Shutting down.","connectionId":"MF_31f861_1733934549999","isTransient":false,"isInitiatedByClient":false,"shutdownMessage":"Finished processing pending tasks.","namespace":"xxxx-xxxx-xxxx.servicebus.windows.net"}

To Reproduce
Steps to reproduce the behavior:

Use a Spring Boot CommandLineRunner application to receive a message from an Azure Service Bus Queue, then close the ServiceBusReceiverClient once the message has been received, then close the context. Error is NOT happening all the time.

Code Snippet

    public static void main(String[] args) {
        ConfigurableApplicationContext context = SpringApplication.run(Application.class, args);
        context.close();
    }

    @Override
    public void run(String... args) {
        receiveMessage();
    }

    void receiveMessage() {

        log.info("Starting the processor");
        processorClient.receiveMessages(1).forEach(m -> {
            boolean success = processMessage(m);

            if (success) {
                log.info("Completing the message");
                processorClient.complete(m);
            } else {
                log.warn("Abandoning the message");
                processorClient.abandon(m);
            }
        });

        log.info("Stopping and closing the processor");
        processorClient.close();
        log.info("Processor stopped and closed");
    }

Expected behavior
Application shuts down cleanly

Screenshots
If applicable, add screenshots to help explain your problem.

Setup (please complete the following information):

OS: Chainguard container image for JDK 23
IDE: [e.g. IntelliJ]
Library/Libraries: com.azure.spring:spring-cloud-azure-starter-servicebus:5.16.0
Java version: 23
App Server/Environment: Azure K8s containers, Spring Boot embedded Netty
Frameworks: Spring Boot

Additional context
This looks to be the recommended way to shutdown a Spring Boot app in our situation, but open to other suggestions that might allow things to close more cleanly/consistently

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

Bug Description Added
Repro Steps Added
Setup information Added

The text was updated successfully, but these errors were encountered:

joshfree · 2024-12-11T22:09:14Z

@yiliuTo @saragluna could you please follow up?

/cc @anuchandy as fyi

saragluna · 2024-12-16T12:40:14Z

@anuchandy, have you seen similar error logs before?

anuchandy · 2024-12-16T16:47:19Z

Hi @saragluna, the first log from AmqpChannelProcessor happens when the connection timeout because of no activity on the connection. The origin of this log, i.e., AmqpChannelProcessor is deprecated and has been replaced with more reliable RequestResponseChannelCache in later versions. Has the spring starter using azure-messaging-servicebus version 7.17.6 or 7.17.7 been released yet? Version 7.17.6+ uses RequestResponseChannelCache as default.

Also, in the usage code, there is no try-catch-ignore for complete()/abandon() hence if these API throws, then control won't reach close(). Additionally, the close() may take up to a minute to shut down the internal state machine.

nateha1984 · 2024-12-16T18:28:09Z

hi @anuchandy, are you suggesting that the error is coming from the complete()/abandon() calls? the rest of the logs seem to show the close() method getting called and no errors being throw from anywhere else (I didn't include all of them because they're so verbose, so only included the lines that are different). we don't see any stack traces getting logged either to indicate an exception was thrown either. we will add the catch you mentioned to see if that makes a difference.

here are the rest of the logs between the abandon call and the different logs that chow up:

2024-12-11T17:14:14.889Z  WARN 1 --- [           main] c.a.c.p.s.f.x.l.QueueMessageListener     : Abandoning the message
2024-12-11T17:14:14.915Z  INFO 1 --- [           main] c.a.c.p.s.f.x.l.QueueMessageListener     : Stopping and closing the processor
2024-12-11T17:14:14.915Z  WARN 1 --- [           main] c.a.c.amqp.implementation.MessageFlux    : {"az.sdk.message":"Downstream cancellation signal arrived at MessageFlux.","messageFlux":"mf_a84048_1733937252404","connectionId":"MF_b743c7_1733937252434","linkName":"xxxx_6677fd_1733937252398","entityPath":"xxxx"}
2024-12-11T17:14:14.919Z  INFO 1 --- [           main] c.a.m.s.ServiceBusReceiverAsyncClient    : Removing receiver links.
2024-12-11T17:14:14.920Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.ReceiveLinkHandler2    : {"az.sdk.message":"onLinkRemoteClose","connectionId":"MF_b743c7_1733937252434","linkName":"xxxx_6677fd_1733937252398","entityPath":"xxxx"}
2024-12-11T17:14:14.920Z  INFO 1 --- [           main] c.a.m.s.ServiceBusClientBuilder          : {"az.sdk.message":"Closing a client using shared connection cache.","numberOfOpenClients":1}
2024-12-11T17:14:14.920Z  INFO 1 --- [           main] c.a.c.p.s.f.x.l.QueueMessageListener     : Processor stopped and closed
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.m.s.ServiceBusReceiverAsyncClient    : Removing receiver links.
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.m.s.ServiceBusClientBuilder          : {"az.sdk.message":"Closing a client using shared connection cache.","numberOfOpenClients":0}
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.m.s.ServiceBusClientBuilder          : No more open clients, closing shared connection cache.
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.c.a.i.ReactorConnectionCache         : {"az.sdk.message":"closing connection (Terminating the connection recovery support.).","entityPath":"N/A","connectionId":"MF_b743c7_1733937252434"}
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.c.a.i.ReactorConnection              : {"az.sdk.message":"Disposing of ReactorConnection.","connectionId":"MF_b743c7_1733937252434","isTransient":false,"isInitiatedByClient":true,"shutdownMessage":"Disposed by client."}
2024-12-11T17:14:14.924Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.ReceiveLinkHandler2    : {"az.sdk.message":"onLinkFinal","connectionId":"MF_b743c7_1733937252434","linkName":"xxxx_6677fd_1733937252398","entityPath":"xxxx"}
2024-12-11T17:14:14.928Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.SendLinkHandler        : {"az.sdk.message":"onLinkRemoteClose","connectionId":"MF_b743c7_1733937252434","linkName":"cbs:sender","entityPath":"$cbs"}
2024-12-11T17:14:14.929Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.ReceiveLinkHandler2    : {"az.sdk.message":"onLinkRemoteClose","connectionId":"MF_b743c7_1733937252434","linkName":"cbs:receiver","entityPath":"$cbs"}
2024-12-11T17:14:14.929Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Channel is closed. Requesting upstream.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.929Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Connection not requested, yet. Requesting one.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.930Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorConnection              : {"az.sdk.message":"Emitting new response channel.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs","linkName":"cbs"}
2024-12-11T17:14:14.930Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Setting next AMQP channel.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.930Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Upstream connection publisher was completed. Terminating processor.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.985Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.SendLinkHandler        : {"az.sdk.message":"Sender link was never active. Closing endpoint states.","connectionId":"MF_b743c7_1733937252434","linkName":"cbs","entityPath":"$cbs"}

anuchandy · 2024-12-17T01:54:33Z

Hello @nateha1984, my main note was that we should ensure close() is always called when the app decides to tear down.

For a run where the container remains without exiting, could you share 15 minutes of debug logs before the app calls close() and 5 minutes following the close() call? i.e., -15/+5 around the close call of an "impacted" container. Also, 10 minutes after calling close()on the same impacted container, take a heap dump and please share the names of all active threads.

nateha1984 · 2025-01-02T15:04:40Z

hi @anuchandy we're trying to duplicate the issue again in our environment but have had limited success (the one time it did happen, someone had turned off debug logging...). I can say that we won't have 15 minutes of run time before the app calls close. It's a very short-lived app and usually completes within a minute or less.

Just wanted to touch base so you know we're working on getting this info for you

nateha1984 · 2025-01-02T17:02:17Z

We got a heap dump, here are the names of the threads:

reactor-executor-1
reactor-http-epoll-1
oneagentperiodiceventsmanaged
reactor-http-epoll-2
reactor-http-epoll-3
reactor-http-epoll-4
oneagentsoftwarecomponentextraction
parallel-1
boundedElastic-evictor-1
oneagentsubpathsender 1.303.62.20241129-131342
oneagentallocationprofiling
Signal Dispatcher
C2 CompilerThread0
C1 CompilerThread0
oneagentperiodicurls
oneagentperiodicrequests
Notification Thread
oneagentautosensor
Attach Listener
DestroyJavaVM
SpringApplicationShutdownHook
Cleaner-0
Finalizer
Logging-Cleaner
Reference Handler
oneagentjavashutdown
Common-Cleaner

anuchandy · 2025-01-02T20:44:26Z

Hello @nateha1984, thank you for the update. Was the heap dump taken from a container that didn't exit 10+ minutes after calling close()? The reactor-executor-1 is the IO Thread for all Service Bus calls and supposed to eventually be disposed.

Also I just noticed that, spring team released spring-cloud-azure-starter-servicebus:5.19.0 version. This version of spring package uses azure-messaging-servicebus: 7.17.7 which has many reliability improvements enabled by default (aka, v2 stack). Could you upgrade to this version?

If version 7.17.7 is loaded correctly (without any older versions affecting runtime), there should not be any logs from AmqpChannelProcessor, but logs from RequestResponseChannelCache should appear.

nateha1984 · 2025-01-06T14:55:47Z

Hi @anuchandy we did the upgrade and I no longer see the reactor-executor-1 thread as active and the logs aren't showing errors from AmqpChannelProcessor. Unfotunately we are still having hanging pods, though we've since added the spring-cloud-azure-stream-binder-eventhubs:5.19.0 dependency as we're actively building out this application. I do see errors related to AmqpChannelProcessor, but they're coming from a different thread which appears to be related to the EH connection, reactor-executor-2.

The addition of the EH dependency seems to be causing the hanging pretty consistently. In this case, we have pods that are still running Monday morning from when they were instantiated Friday afternoon. We're using the Reactive Sinks method for sending the EH messages instead of the Azure EH client directly, so we're not managing the connection manually. Do we need to handle the EH connection a different way?

Here are the active threads we see in the dump now:

reactor-executor-2
reactor-http-epoll-1
reactor-http-epoll-2
parallel-1
oneagentperiodiceventsmanaged
reactor-http-epoll-3
reactor-http-epoll-4
oneagentsoftwarecomponentextraction
boundedElastic-evictor-1
oneagentsubpathsender 1.303.62.20241129-131342
oneagentallocationprofiling
Signal Dispatcher
C2 CompilerThread0
C1 CompilerThread0
oneagentperiodicrequests
oneagentperiodicurls
Notification Thread
oneagentautosensor
DestroyJavaVM
Attach Listener
SpringApplicationShutdownHook
Cleaner-0
Finalizer
Logging-Cleaner
Reference Handler
oneagentjavashutdown
Common-Cleaner

anuchandy · 2025-01-07T23:32:49Z

Hi @nateha1984, good to hear that the IO thread is being cleaned up with new azure-messaging-servicebus. We are in the process of releasing azure-messaging-eventhubs that avoid using the type AmqpChannelProcessor by default. Once azure-messaging-eventhubs (and other SDKs) are released, the Spring team will release spring-cloud-azure-stream-binder-eventhubs. The releases should be completed in the next 2-3 weeks based on the January SDK release cadence.

to answer your question - I believe that the Spring library spring-cloud-azure-stream-binder-eventhubs, which abstracts the Event Hubs Client behind the Sink API, will automatically handle the client closure. But I will let @saragluna to confirm this, as I'm not familiar with the spring internals.

saragluna · 2025-01-08T08:37:00Z

Yes, the binder will close the processor clients (when consuming messages):

azure-sdk-for-java/sdk/spring/spring-integration-azure-eventhubs/src/main/java/com/azure/spring/integration/eventhubs/inbound/EventHubsInboundChannelAdapter.java

Lines 138 to 140 in 00fccde

    
           protected void doStop() { 
        
               this.listenerContainer.stop(); 
        
           }

azure-sdk-for-java/sdk/spring/spring-messaging-azure/src/main/java/com/azure/spring/messaging/listener/AbstractMessageListenerContainer.java

Lines 62 to 71 in 00fccde

    
           @Override 
        
           public void stop() { 
        
               LOG.debug("Stopping container with name {}", getBeanName()); 
        
               synchronized (this.lifecycleMonitor) { 
        
                   this.running = false; 
        
                   this.lifecycleMonitor.notifyAll(); 
        
               } 
        
               doStop(); 
        
           }

azure-sdk-for-java/sdk/spring/spring-messaging-azure/src/main/java/com/azure/spring/messaging/listener/AbstractMessageListenerContainer.java

Line 43 in 00fccde

protected abstract void doStop();

azure-sdk-for-java/sdk/spring/spring-messaging-azure-eventhubs/src/main/java/com/azure/spring/messaging/eventhubs/core/listener/EventHubsMessageListenerContainer.java

Lines 58 to 63 in 968ec2e

    
           @Override 
        
           protected void doStop() { 
        
               if (this.delegate != null) { 
        
                   this.delegate.stop(); 
        
               } 
        
           }

But for the sending side, we don't register a stop method.

nateha1984 · 2025-01-08T14:20:42Z

Ok, so if we're only sending to an event hub and need to close to the client, we cannot use the binders/Sink api? We should use the event hub client directly, correct?

Are there plans to shutdown the client sending side with the Sink API as well? I don't understand why only one way is shutdown in this manner to be honest

saragluna · 2025-01-09T00:58:22Z

I think we could fix this.

joshfree added azure-spring All azure-spring related issues azure-spring-servicebus Spring service bus related issues. labels Dec 11, 2024

github-project-automation bot added this to Spring Cloud Azure Dec 11, 2024

github-project-automation bot moved this to Todo in Spring Cloud Azure Dec 11, 2024

github-actions bot removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Dec 11, 2024

joshfree assigned yiliuTo Dec 11, 2024

saragluna assigned saragluna and unassigned yiliuTo Dec 16, 2024

saragluna added this to the 2025-01 milestone Dec 16, 2024

saragluna modified the milestones: 2025-01, 2025-02 Dec 27, 2024

saragluna modified the milestones: 2025-02, 2025-03 Feb 17, 2025

saragluna removed this from the 2025-03 milestone Feb 25, 2025

saragluna added this to the 2025-04 milestone Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Spring Boot application sometimes does not shutdown after closing client and Spring context #43345

[BUG] Spring Boot application sometimes does not shutdown after closing client and Spring context #43345

nateha1984 commented Dec 11, 2024

joshfree commented Dec 11, 2024

saragluna commented Dec 16, 2024

anuchandy commented Dec 16, 2024

nateha1984 commented Dec 16, 2024

anuchandy commented Dec 17, 2024

nateha1984 commented Jan 2, 2025

nateha1984 commented Jan 2, 2025

anuchandy commented Jan 2, 2025

nateha1984 commented Jan 6, 2025

anuchandy commented Jan 7, 2025 •

edited

Loading

saragluna commented Jan 8, 2025

nateha1984 commented Jan 8, 2025

saragluna commented Jan 9, 2025

[BUG] Spring Boot application sometimes does not shutdown after closing client and Spring context #43345

[BUG] Spring Boot application sometimes does not shutdown after closing client and Spring context #43345

Comments

nateha1984 commented Dec 11, 2024

joshfree commented Dec 11, 2024

saragluna commented Dec 16, 2024

anuchandy commented Dec 16, 2024

nateha1984 commented Dec 16, 2024

anuchandy commented Dec 17, 2024

nateha1984 commented Jan 2, 2025

nateha1984 commented Jan 2, 2025

anuchandy commented Jan 2, 2025

nateha1984 commented Jan 6, 2025

anuchandy commented Jan 7, 2025 • edited Loading

saragluna commented Jan 8, 2025

nateha1984 commented Jan 8, 2025

saragluna commented Jan 9, 2025

anuchandy commented Jan 7, 2025 •

edited

Loading