Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spring Boot application sometimes does not shutdown after closing client and Spring context #43345

Open
3 tasks done
nateha1984 opened this issue Dec 11, 2024 · 13 comments
Assignees
Labels
azure-spring All azure-spring related issues azure-spring-servicebus Spring service bus related issues. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Milestone

Comments

@nateha1984
Copy link

Describe the bug
Occasionally the Service Bus connection is preventing a Spring Boot application from shutting down after closing the client. We have an application that is started with a KEDA event to consume one message from a service bus queue. Most of the time, the application/job starts in AKS, consumes the message, then we close the client and the Spring context, which shuts down the app and thus stops the container. At times, there looks to be some failure in the client close code that ends up preventing the Spring app and thus the container from stopping, so it continues to run the job until we manually kill the pod.

When this happens, we can see the messages below in the logs, but no exception is logged (just the Azure SDK json messages).

Exception or Stack Trace

2024-12-11T17:14:14.985Z  WARN 1 --- [           main] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Cannot subscribe. Processor is already terminated.","exception":"Cannot subscribe. Processor is already terminated.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.987Z  INFO 1 --- [           main] c.a.c.a.i.ReactorConnectionCache         : {"az.sdk.message":"connection close finished with error.","exception":"Cannot subscribe. Processor is already terminated.","entityPath":"N/A","connectionId":"MF_b743c7_1733937252434"}
2024-12-11T17:19:14.927Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.ConnectionHandler      : {"az.sdk.message":"onConnectionRemoteClose","connectionId":"MF_b743c7_1733937252434","errorCondition":"amqp:connection:forced","errorDescription":"The connection was closed by container '7e9fe2d10e6941f592878859b2be6e61_G6' because it did not have any active links in the past 300000 milliseconds. TrackingId:7e9fe2d10e6941f592878859b2be6e61_G6, SystemTracker:gateway5, Timestamp:2024-12-11T17:19:14","hostName":"xxx-xxx-xxx.servicebus.windows.net"}

When the application DOES shutdown successfully, we see these messages at the end:

2024-12-11T16:30:16.364Z  INFO 1 --- [ctor-executor-1] c.a.c.a.implementation.ReactorExecutor   : {"az.sdk.message":"Processing all pending tasks and closing old reactor.","connectionId":"MF_31f861_1733934549999"}
2024-12-11T16:30:16.364Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorDispatcher              : {"az.sdk.message":"Reactor selectable is being disposed.","connectionId":"MF_31f861_1733934549999"}
2024-12-11T16:30:16.364Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorConnection              : {"az.sdk.message":"onConnectionShutdown. Shutting down.","connectionId":"MF_31f861_1733934549999","isTransient":false,"isInitiatedByClient":false,"shutdownMessage":"connectionId[MF_31f861_1733934549999] Reactor selectable is disposed.","namespace":"xxx-xxx-xxx.servicebus.windows.net"}
2024-12-11T16:30:16.366Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorConnectionCache         : {"az.sdk.message":"connection close finished.","entityPath":"N/A","connectionId":"MF_31f861_1733934549999"}
2024-12-11T16:30:16.366Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorConnection              : {"az.sdk.message":"onConnectionShutdown. Shutting down.","connectionId":"MF_31f861_1733934549999","isTransient":false,"isInitiatedByClient":false,"shutdownMessage":"Finished processing pending tasks.","namespace":"xxxx-xxxx-xxxx.servicebus.windows.net"}

To Reproduce
Steps to reproduce the behavior:

Use a Spring Boot CommandLineRunner application to receive a message from an Azure Service Bus Queue, then close the ServiceBusReceiverClient once the message has been received, then close the context. Error is NOT happening all the time.

Code Snippet

    public static void main(String[] args) {
        ConfigurableApplicationContext context = SpringApplication.run(Application.class, args);
        context.close();
    }

    @Override
    public void run(String... args) {
        receiveMessage();
    }

    void receiveMessage() {

        log.info("Starting the processor");
        processorClient.receiveMessages(1).forEach(m -> {
            boolean success = processMessage(m);

            if (success) {
                log.info("Completing the message");
                processorClient.complete(m);
            } else {
                log.warn("Abandoning the message");
                processorClient.abandon(m);
            }
        });

        log.info("Stopping and closing the processor");
        processorClient.close();
        log.info("Processor stopped and closed");
    }

Expected behavior
Application shuts down cleanly

Screenshots
If applicable, add screenshots to help explain your problem.

Setup (please complete the following information):

  • OS: Chainguard container image for JDK 23
  • IDE: [e.g. IntelliJ]
  • Library/Libraries: com.azure.spring:spring-cloud-azure-starter-servicebus:5.16.0
  • Java version: 23
  • App Server/Environment: Azure K8s containers, Spring Boot embedded Netty
  • Frameworks: Spring Boot

Additional context
This looks to be the recommended way to shutdown a Spring Boot app in our situation, but open to other suggestions that might allow things to close more cleanly/consistently

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added
@github-actions github-actions bot added customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Dec 11, 2024
@joshfree joshfree added azure-spring All azure-spring related issues azure-spring-servicebus Spring service bus related issues. labels Dec 11, 2024
@github-actions github-actions bot removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Dec 11, 2024
@joshfree
Copy link
Member

@yiliuTo @saragluna could you please follow up?

/cc @anuchandy as fyi

@saragluna
Copy link
Member

@anuchandy, have you seen similar error logs before?

@saragluna saragluna assigned saragluna and unassigned yiliuTo Dec 16, 2024
@saragluna saragluna added this to the 2025-01 milestone Dec 16, 2024
@anuchandy
Copy link
Member

Hi @saragluna, the first log from AmqpChannelProcessor happens when the connection timeout because of no activity on the connection. The origin of this log, i.e., AmqpChannelProcessor is deprecated and has been replaced with more reliable RequestResponseChannelCache in later versions. Has the spring starter using azure-messaging-servicebus version 7.17.6 or 7.17.7 been released yet? Version 7.17.6+ uses RequestResponseChannelCache as default.

Also, in the usage code, there is no try-catch-ignore for complete()/abandon() hence if these API throws, then control won't reach close(). Additionally, the close() may take up to a minute to shut down the internal state machine.

@nateha1984
Copy link
Author

hi @anuchandy, are you suggesting that the error is coming from the complete()/abandon() calls? the rest of the logs seem to show the close() method getting called and no errors being throw from anywhere else (I didn't include all of them because they're so verbose, so only included the lines that are different). we don't see any stack traces getting logged either to indicate an exception was thrown either. we will add the catch you mentioned to see if that makes a difference.

here are the rest of the logs between the abandon call and the different logs that chow up:

2024-12-11T17:14:14.889Z  WARN 1 --- [           main] c.a.c.p.s.f.x.l.QueueMessageListener     : Abandoning the message
2024-12-11T17:14:14.915Z  INFO 1 --- [           main] c.a.c.p.s.f.x.l.QueueMessageListener     : Stopping and closing the processor
2024-12-11T17:14:14.915Z  WARN 1 --- [           main] c.a.c.amqp.implementation.MessageFlux    : {"az.sdk.message":"Downstream cancellation signal arrived at MessageFlux.","messageFlux":"mf_a84048_1733937252404","connectionId":"MF_b743c7_1733937252434","linkName":"xxxx_6677fd_1733937252398","entityPath":"xxxx"}
2024-12-11T17:14:14.919Z  INFO 1 --- [           main] c.a.m.s.ServiceBusReceiverAsyncClient    : Removing receiver links.
2024-12-11T17:14:14.920Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.ReceiveLinkHandler2    : {"az.sdk.message":"onLinkRemoteClose","connectionId":"MF_b743c7_1733937252434","linkName":"xxxx_6677fd_1733937252398","entityPath":"xxxx"}
2024-12-11T17:14:14.920Z  INFO 1 --- [           main] c.a.m.s.ServiceBusClientBuilder          : {"az.sdk.message":"Closing a client using shared connection cache.","numberOfOpenClients":1}
2024-12-11T17:14:14.920Z  INFO 1 --- [           main] c.a.c.p.s.f.x.l.QueueMessageListener     : Processor stopped and closed
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.m.s.ServiceBusReceiverAsyncClient    : Removing receiver links.
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.m.s.ServiceBusClientBuilder          : {"az.sdk.message":"Closing a client using shared connection cache.","numberOfOpenClients":0}
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.m.s.ServiceBusClientBuilder          : No more open clients, closing shared connection cache.
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.c.a.i.ReactorConnectionCache         : {"az.sdk.message":"closing connection (Terminating the connection recovery support.).","entityPath":"N/A","connectionId":"MF_b743c7_1733937252434"}
2024-12-11T17:14:14.923Z  INFO 1 --- [           main] c.a.c.a.i.ReactorConnection              : {"az.sdk.message":"Disposing of ReactorConnection.","connectionId":"MF_b743c7_1733937252434","isTransient":false,"isInitiatedByClient":true,"shutdownMessage":"Disposed by client."}
2024-12-11T17:14:14.924Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.ReceiveLinkHandler2    : {"az.sdk.message":"onLinkFinal","connectionId":"MF_b743c7_1733937252434","linkName":"xxxx_6677fd_1733937252398","entityPath":"xxxx"}
2024-12-11T17:14:14.928Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.SendLinkHandler        : {"az.sdk.message":"onLinkRemoteClose","connectionId":"MF_b743c7_1733937252434","linkName":"cbs:sender","entityPath":"$cbs"}
2024-12-11T17:14:14.929Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.ReceiveLinkHandler2    : {"az.sdk.message":"onLinkRemoteClose","connectionId":"MF_b743c7_1733937252434","linkName":"cbs:receiver","entityPath":"$cbs"}
2024-12-11T17:14:14.929Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Channel is closed. Requesting upstream.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.929Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Connection not requested, yet. Requesting one.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.930Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.ReactorConnection              : {"az.sdk.message":"Emitting new response channel.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs","linkName":"cbs"}
2024-12-11T17:14:14.930Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Setting next AMQP channel.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.930Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.AmqpChannelProcessor           : {"az.sdk.message":"Upstream connection publisher was completed. Terminating processor.","connectionId":"MF_b743c7_1733937252434","entityPath":"$cbs"}
2024-12-11T17:14:14.985Z  INFO 1 --- [ctor-executor-1] c.a.c.a.i.handler.SendLinkHandler        : {"az.sdk.message":"Sender link was never active. Closing endpoint states.","connectionId":"MF_b743c7_1733937252434","linkName":"cbs","entityPath":"$cbs"}

@anuchandy
Copy link
Member

Hello @nateha1984, my main note was that we should ensure close() is always called when the app decides to tear down.

For a run where the container remains without exiting, could you share 15 minutes of debug logs before the app calls close() and 5 minutes following the close() call? i.e., -15/+5 around the close call of an "impacted" container. Also, 10 minutes after calling close()on the same impacted container, take a heap dump and please share the names of all active threads.

@saragluna saragluna modified the milestones: 2025-01, 2025-02 Dec 27, 2024
@nateha1984
Copy link
Author

hi @anuchandy we're trying to duplicate the issue again in our environment but have had limited success (the one time it did happen, someone had turned off debug logging...). I can say that we won't have 15 minutes of run time before the app calls close. It's a very short-lived app and usually completes within a minute or less.

Just wanted to touch base so you know we're working on getting this info for you

@nateha1984
Copy link
Author

We got a heap dump, here are the names of the threads:

reactor-executor-1
reactor-http-epoll-1
oneagentperiodiceventsmanaged
reactor-http-epoll-2
reactor-http-epoll-3
reactor-http-epoll-4
oneagentsoftwarecomponentextraction
parallel-1
boundedElastic-evictor-1
oneagentsubpathsender 1.303.62.20241129-131342
oneagentallocationprofiling
Signal Dispatcher
C2 CompilerThread0
C1 CompilerThread0
oneagentperiodicurls
oneagentperiodicrequests
Notification Thread
oneagentautosensor
Attach Listener
DestroyJavaVM
SpringApplicationShutdownHook
Cleaner-0
Finalizer
Logging-Cleaner
Reference Handler
oneagentjavashutdown
Common-Cleaner

@anuchandy
Copy link
Member

Hello @nateha1984, thank you for the update. Was the heap dump taken from a container that didn't exit 10+ minutes after calling close()? The reactor-executor-1 is the IO Thread for all Service Bus calls and supposed to eventually be disposed.

Also I just noticed that, spring team released spring-cloud-azure-starter-servicebus:5.19.0 version. This version of spring package uses azure-messaging-servicebus: 7.17.7 which has many reliability improvements enabled by default (aka, v2 stack). Could you upgrade to this version?

If version 7.17.7 is loaded correctly (without any older versions affecting runtime), there should not be any logs from AmqpChannelProcessor, but logs from RequestResponseChannelCache should appear.

@nateha1984
Copy link
Author

Hi @anuchandy we did the upgrade and I no longer see the reactor-executor-1 thread as active and the logs aren't showing errors from AmqpChannelProcessor. Unfotunately we are still having hanging pods, though we've since added the spring-cloud-azure-stream-binder-eventhubs:5.19.0 dependency as we're actively building out this application. I do see errors related to AmqpChannelProcessor, but they're coming from a different thread which appears to be related to the EH connection, reactor-executor-2.

The addition of the EH dependency seems to be causing the hanging pretty consistently. In this case, we have pods that are still running Monday morning from when they were instantiated Friday afternoon. We're using the Reactive Sinks method for sending the EH messages instead of the Azure EH client directly, so we're not managing the connection manually. Do we need to handle the EH connection a different way?

Here are the active threads we see in the dump now:

reactor-executor-2
reactor-http-epoll-1
reactor-http-epoll-2
parallel-1
oneagentperiodiceventsmanaged
reactor-http-epoll-3
reactor-http-epoll-4
oneagentsoftwarecomponentextraction
boundedElastic-evictor-1
oneagentsubpathsender 1.303.62.20241129-131342
oneagentallocationprofiling
Signal Dispatcher
C2 CompilerThread0
C1 CompilerThread0
oneagentperiodicrequests
oneagentperiodicurls
Notification Thread
oneagentautosensor
DestroyJavaVM
Attach Listener
SpringApplicationShutdownHook
Cleaner-0
Finalizer
Logging-Cleaner
Reference Handler
oneagentjavashutdown
Common-Cleaner

@anuchandy
Copy link
Member

anuchandy commented Jan 7, 2025

Hi @nateha1984, good to hear that the IO thread is being cleaned up with new azure-messaging-servicebus. We are in the process of releasing azure-messaging-eventhubs that avoid using the type AmqpChannelProcessor by default. Once azure-messaging-eventhubs (and other SDKs) are released, the Spring team will release spring-cloud-azure-stream-binder-eventhubs. The releases should be completed in the next 2-3 weeks based on the January SDK release cadence.

to answer your question - I believe that the Spring library spring-cloud-azure-stream-binder-eventhubs, which abstracts the Event Hubs Client behind the Sink API, will automatically handle the client closure. But I will let @saragluna to confirm this, as I'm not familiar with the spring internals.

@saragluna
Copy link
Member

Yes, the binder will close the processor clients (when consuming messages):

@Override
public void stop() {
LOG.debug("Stopping container with name {}", getBeanName());
synchronized (this.lifecycleMonitor) {
this.running = false;
this.lifecycleMonitor.notifyAll();
}
doStop();
}

But for the sending side, we don't register a stop method.

@nateha1984
Copy link
Author

Ok, so if we're only sending to an event hub and need to close to the client, we cannot use the binders/Sink api? We should use the event hub client directly, correct?

Are there plans to shutdown the client sending side with the Sink API as well? I don't understand why only one way is shutdown in this manner to be honest

@saragluna
Copy link
Member

I think we could fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
azure-spring All azure-spring related issues azure-spring-servicebus Spring service bus related issues. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
Status: Todo
Development

No branches or pull requests

5 participants