Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Termination Timeout option #112

Closed
orlandothoeny opened this issue Mar 24, 2023 · 9 comments · Fixed by #147
Closed

Feature: Termination Timeout option #112

orlandothoeny opened this issue Mar 24, 2023 · 9 comments · Fixed by #147

Comments

@orlandothoeny
Copy link
Contributor

A configuration option that adds a configurable wait time before the proxy is shutting down would be useful for use in Kubernetes environments.

Use Case

Our concrete case:

Kubernetes Pod

  • Supervisor container
    • Ensures that a PHP RabbitMQ consumer command is running all the time, the PHP command exits after x seconds, then Supervisor starts it again
  • AMQProxy sidecar container
    • Used to connect to CloudAMQP RabbitMQ instance

Problem
A new application version gets deployed. The old Pod is stopped and a new one is started.
When the old Pod is stopped, all containers of that Pod receive a SIGTERM signal:

  • AMQProxy sidecar container stops immediately
  • Supervisor waits until PHP consumer command exits. The command might still need a RabbitMQ connection to finish processing its current batch of messages. It raises errors when trying to interact with the RabbitMQ server because the proxy already stopped.

This could be avoided if the AMQProxy sidecar container waited long enough until the Supervisor container stopped.

Current solution/workaround
Using a preStop hook for the AMQProxy sidecar container

lifecycle:
        preStop:
          exec:
            command: ["sleep", "30"]

Feature

  • Configuration option: Termination timeout, in seconds
    • ENV variable + CLI option
  • When the proxy receives a SIGTERM signal and the option is configured, it waits for x seconds before stopping
    • Current connections are not closed
    • New connections can still be opened

--

Google's CloudSQL Proxy has something like this:

$ docker run gcr.io/cloudsql-docker/gce-proxy /cloud_sql_proxy -help
...
-term_timeout
    When set, the proxy will wait for existing connections to close before
    terminating. Any connections that haven't closed after the timeout will be
    dropped
@carlhoerberg
Copy link
Member

Are you sure about "New connections can still be opened"? Right now the behavior is to close the listening socket at the first TERM, send Connection::Close frame to all client connections and wait indefinitely for them to close the sockets. On the second TERM all client sockets are forcefully closed.

@orlandothoeny
Copy link
Contributor Author

Hi @carlhoerberg,

For our particular sidecar case, it would have been the desired behaviour. Since the proxy instance has no other client that is opening connections to it than the "main" container inside the Kubernetes Pod that is consuming RabbitMQ messages. When the Pod gets terminated, the "main" container is as well. After it's dead, nobody is opening proxy connections and it will be able to stop gracefully without having to forcefully terminate any open connections.

However I can see that this might not be desireable in a setup where a proxy is used by multiple independent clients. For example in a separate Kubernetes Deployment (that's what we switched to from our initial sidecar setup, mainly because of #108 though). In that case it would still accept new connections and kill them when the second TERM is received. To avoid this, the proxy would have to provide some dedicated health check endpoint that can be used for it's Kubernetes readiness probe. (Although I'm not 100% sure if Kubernetes stops routing traffic to a Pod when it's being terminated anyway)

@orlandothoeny
Copy link
Contributor Author

Just checked the Kubernetes Service behavior when a Pod gets terminated. As far as I can see, it won't receive new traffic anymore. Regardless of the readiness probe status: https://kubernetes.io/docs/tutorials/services/pods-and-endpoint-termination-flow/

@carlhoerberg
Copy link
Member

Ok, so might as well stop the listerner too then at first TERM, won't affect k8s users, but make a more logical behavior for others, right?

@carlhoerberg
Copy link
Member

btw, #108 is fixed, sorry for the long delay, should have been fixed way earlier...

@orlandothoeny
Copy link
Contributor Author

orlandothoeny commented Feb 22, 2024

Ok, so might as well stop the listerner too then at first TERM, won't affect k8s users, but make a more logical behavior for others, right?

You mean stop accepting new connections? I guess it depends on the topology

  • Sidecar: There you would still be able to run into the case where the consumer container still tries to open a new connection if the timing between the consumer stopping to consume and the proxy stopping to accept connections is a bit off. Another case that is possible: The consumer finishes the current message that is being consumed, during message processing, the code tries to publish a RabbitMQ message
  • Separate Deployment with a Service: There I think it should not matter, since the Service stops routing traffic on termination. Not sure if there might be timing issues though (Maybe the Service stopping traffic takes a few ms and the proxy already stopped accepting new connections)

@orlandothoeny
Copy link
Contributor Author

orlandothoeny commented Mar 11, 2024

Just tested a little and managed to get AMQProxy shutdown(0, 0) errors from our consumers while deploying a new version of our Kubernetes AMQProxy Deployment (rolling update).

So this would be a nice feature for the case where AMQProxy is deployed as a separate component (e.g. Kubernetes Deployment).

@orlandothoeny
Copy link
Contributor Author

@carlhoerberg Just upgraded to the newest version (2.0.1) and tested the shutdown behavior. I configured the new --term-timeout option (set to 5 mins).

In our case, we still receive errors when the proxy is terminating.
Since - as far as I understand - it immediately closes all client connections when it receives the TERM signal.
Thus, the connection is taken away from the client immediately, with the message AMQProxy shutdown(0, 0) (https://github.com/cloudamqp/amqproxy/blob/main/src/amqproxy/client.cr#L195-L200).
If I understand the code correctly, this is done by this line: https://github.com/cloudamqp/amqproxy/pull/147/files#diff-60939064684e0ee6ef504d4b72e12d6cf0febd05f0426f4189b799c522816399R93

Would there be negative consequences when the client connections are only closed after the timeout has passed (after the sleep https://github.com/cloudamqp/amqproxy/blob/main/src/amqproxy.cr#L96)?

@orlandothoeny
Copy link
Contributor Author

A nice-to-have feature would be that it checks if all clients have closed their connections by themselves and terminates early if that is the case.
Looping over all connections (until term-timeout is reached) and checking each one if it is still open or something like that.

Since there might be no connections around anymore before the --term-timeout value is reached. Making the proxy unnecessarily wait before exiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants