Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to not send a connection_close on SIGTERM #170

Open
mruell opened this issue Aug 22, 2024 · 6 comments · May be fixed by #175
Open

Option to not send a connection_close on SIGTERM #170

mruell opened this issue Aug 22, 2024 · 6 comments · May be fixed by #175

Comments

@mruell
Copy link

mruell commented Aug 22, 2024

Hi there!

The way we use AMQProxy, we depend on it being able to perform a rolling restart (in a kubernetes cluster).
For that we send a SIGTERM to the AMQProxy.

This commit changed the SIGTERM behaviour though: 98c0bc7 (The else in L81 moved in amqproxy.cr)

Now the shutdown is not graceful anymore, since it does not only wait for all clients to disconnect voluntary, but sends a connection close message proactively to the clients.

According to the AMQP reference, that means:

After sending this method, any received methods except Close and Close-OK MUST be discarded. The response to receiving a Close after sending Close must be to send Close-Ok.

Since this is the official reference, it is also implemented like this in most AMQP client libraries and one can't ignore a connection close message. (See for example in the most popular PHP AMQP client library)

That means one is not able to e.g. finish processing the current message before shutting down, but instead it's possible, that a worker is forced to shutdown while processing a message. This can lead to messages being processed multiple times.

As a quick fix (since we build from source anyway) we use this patch on the file amqproxy/src/amqproxy.cr

--- amqproxy_original.cr	2024-08-16 16:28:59
+++ amqproxy.cr	2024-08-16 16:36:12
@@ -90,14 +90,17 @@
       if first_shutdown
         first_shutdown = false
         server.stop_accepting_clients
-        server.disconnect_clients
         if @term_timeout >= 0
           spawn do
             sleep @term_timeout
+            server.disconnect_clients
+            sleep 5
             abort "Exiting with #{server.client_connections} client connections still open"
           end
         end
       else
+        server.disconnect_clients
+        sleep 5
         abort "Exiting with #{server.client_connections} client connections still open"
       end
     end

The solutions that come to my mind would be one of these:

  • The changes above to be merged (which would be a breaking change though)
  • Like other servers do, there could be multiple signals - e.g. a SIGTERM could send the connection close, while maybe SIGINT could be really graceful by not sending the connection close
  • A CLI option, to change the SIGTERM behaviour of whether to send a connection close or not

Would be happy to hear what you think about this :)

Thanks a lot in advance and best,
Marvin

@carlhoerberg
Copy link
Member

Ok, thank you for your feedback! It makes sense. We will look at this.

@orlandothoeny
Copy link
Contributor

We experience the same issue.
As mentioned in #112 (comment) & #112 (comment)

@spuun spuun linked a pull request Sep 25, 2024 that will close this issue
@spuun
Copy link
Member

spuun commented Sep 27, 2024

@orlandothoeny @mruell would #175 be a good solution for you?

@orlandothoeny
Copy link
Contributor

@spuun Yes that looks like it would to the job

@mruell
Copy link
Author

mruell commented Sep 27, 2024

Hi @orlandothoeny

that solution looks like it will work!

The way I understand the changes, there will be a new option, to wait after not accepting new connections anymore until we forcefully close them by sending a close message.

That's pretty much the same logic as the terminationGracePeriodSeconds property for kubernetes pods.
We use 4 hours for that.

So if we set the AMQProxy grace period just a little lower than the pod grace period, the workers should have enough time to finish their work and reconnect to a not-terminating pod (they are reconnecting every 15 minutes in-between processing messages, to reconnect to a server accepting new connections).

The only downside is, that the worker will not know, that the AMQProxy tries to terminate (without additional monitors), so we have to always reconnect every 15 minutes, just in case. But looks to me that the AMQP protocol just has no way of "softly" notifying a worker/consumer that it wants to shutdown.

But the reconnects do not really impact the performance in any meaningful way, so this is the best way to do it I think.

Thanks for that and best,
Marvin

@intxcc
Copy link

intxcc commented Sep 27, 2024

sorry, last comment was supposed to mention @spuun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants