-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully stop blueprint containers before committing them #363
Conversation
Yeah, it's absolutely not - we need to rearrange things so that postgres is run by supervisord, so that the SIGTERM gets propagated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lgtm, but we should probably ask @kegsay what he thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be causing a large delay when deploying blueprints:
federation_room_ban_test.go:16: Deploy times: 1m21.771543808s blueprints, 2.870558152s containers
vs
federation_room_ban_test.go:16: Deploy times: 32.279497517s blueprints, 8.70062934s containers
Likely because the docker container isn't quitting via SIGTERM. Please can we reduce the timeout to something like 10s instead of 30s to reduce the impact of this whilst people update their complement images.
} | ||
|
||
if !containerInfo.State.Running { | ||
// The container isn't running anyway, so no need to kill it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now stopping the containers so I don't see how containerInfo.State.Running
can ever be true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a great deal of Go experience, but I view this defer as a last-resort clean-up mechanism, like a finally block in Java/etc. Anecdotally, I've had docker stop leave a container running before, but frankly this may well have been a bug (I'm not sure). I'm happy to remove it if you think that's best, but it really was just intended as a last resort 'stopping gracefully is nice, but littering the host with containers is to be avoided if at all possible'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've seen this in the wild then let's leave it in. Docker is async command wise (you'll note several places we repeat the same instruction until timeout/success) so I can believe it not stopping the container when you ask for it to stop.
This PR arises from a conversation with @richvdh, who was presumably working on getting Synapse tested in Complement with Postgres and workers.
The blueprint container gets paused and then committed, which leads to the Postgres database being corrupted in the image that gets created. This then leads to slow startup for that container, since Postgres has to do some kind of recovery process.
It seems like it would be better to gracefully stop the container before committing it, so that Postgres doesn't get corrupted in the process.
Docker stops a container by SIGTERMing it (so the process inside the container can theoretically react to that and shut down gracefully), then SIGKILLing if it didn't stop after a timeout.
This PR will use the equivalent of
docker stop
to shut down the container before committing it. A well-written container could then react to the SIGTERM and shut down the database gracefully.I'm not sure the container image in question is set up to handle SIGTERM properly, though — the container seems to be taking 30 s (= timeout) to stop so I presume it's still getting killed forcefully.
At least this gives us the option to implement that properly...
I used
COMPLEMENT_DEBUG=1 WORKERS=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=
pwd/../complement ./scripts-dev/complement.sh -run TestInboundFederationKeys 2>&1 | tee log
in a Synapse checkout, as Rich suggested.