-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in service.running on systemd #55540
Fix race condition in service.running on systemd #55540
Conversation
@terminalmage I'm not sure what to do about the merge conflict, can you take a look when you have time? |
How do services notify systemd? Would it be possible for Salt to either listen for this same sort of message, or have systemd then notify Salt? |
This state will check if the service is running after it runs a `service.start` on it. However, systemd has more potential states than just up or down. It also has the concept of `activating` and `deactivating` states. An example of this are services which use `Type=notify`, and rely on the service to tell systemd that it is up. For these services, `systemctl start` will exit (ceding control back to Salt) and systemd will await its notification. In that interim, the service will be in either the `activating` or `deactivating` state. Importantly, Salt uses `systemctl is-active service_name` to check if the service is up, and any state other than `active` results in a nonzero exit code, which Salt interprets as the service being down. So, if the notification doesn't come quick enough, then when Salt checks on the service's status post-start, it will appear to Salt to be down when it is actually in the `activating` state. This commit modifies the `systemd_service` module such that, when the status is `activating` or `deactivating`, the `systemctl is-active` will be periodically retried, up to a tunable amount of time (by default 3 seconds), until some result other than `activating` or `deactivating` is returned (or the timeout is reached, at which time the service will be assumed to be down). This will keep services from being misinterpreted as being dead when it just took a little longer than normal to start. I realize that there is already an `init_delay` argument for this state, but this _always_ sleeps for that period of time, and also applies to all `service` modules. The idea behind making changes to the `systemd_service` module is to catch many issues like this _before_ you have to start troubleshooting why it's being identified as dead when it's not actually dead. I'm open to suggestions.
ba76029
to
4136ca7
Compare
Here's a good example: https://github.com/saltstack/salt/blob/d60060c/salt/utils/process.py#L161-L191
Possible?... Maybe? Feasible?... Arguably not. |
@terminalmage Thanks for resolving the conflicts. |
@@ -475,7 +487,7 @@ def running(name, | |||
time.sleep(init_delay) | |||
|
|||
# only force a change state if we have explicitly detected them | |||
after_toggle_status = __salt__['service.status'](name, sig) | |||
after_toggle_status = __salt__['service.status'](name, sig, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug here when using service.running
with reload
keyword:
[ERROR ] An exception occurred in this state: Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/salt/state.py", line 1981, in call
**cdata['kwargs'])
File "/usr/lib/python3/dist-packages/salt/loader.py", line 1977, in wrapper
return f(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/states/service.py", line 490, in running
after_toggle_status = __salt__['service.status'](name, sig, **kwargs)
TypeError: status() got an unexpected keyword argument 'reload'
This state will check if the service is running after it runs a
service.start
on it. However, systemd has more potential states thanjust up or down. It also has the concept of
activating
anddeactivating
states. An example of this are services which useType=notify
, and rely on the service to tell systemd that it is up.For these services,
systemctl start
(orstop
) will exit (ceding control back toSalt) and systemd will await its notification. In that interim, the
service will be in either the
activating
ordeactivating
state.Importantly, Salt uses
systemctl is-active service_name
to check ifthe service is up, and any state other than
active
results in anonzero exit code, which Salt interprets as the service being down.
So, if the notification doesn't come quick enough, then when Salt checks
on the service's status post-start, it will appear to Salt to be down
when it is actually in the
activating
state.This commit modifies the
systemd_service
module such that, when thestatus is
activating
ordeactivating
, thesystemctl is-active
willbe periodically retried, up to a tunable amount of time (by default 3
seconds), until some result other than
activating
ordeactivating
isreturned (or the timeout is reached, at which time the service will be
assumed to be down). This will keep services from being misinterpreted
as being dead when it just took a little longer than normal to start.
I realize that there is already an
init_delay
argument for this state,but this always sleeps for that period of time, and also applies to
all
service
modules. The idea behind making changes to thesystemd_service
module is to catch many issues like this before youhave to start troubleshooting why it's being identified as dead when
it's not actually dead. I'm open to suggestions.