Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in service.running on systemd #55540

Merged
merged 2 commits into from
Jan 4, 2020

Conversation

terminalmage
Copy link
Contributor

@terminalmage terminalmage commented Dec 5, 2019

This state will check if the service is running after it runs a
service.start on it. However, systemd has more potential states than
just up or down. It also has the concept of activating and
deactivating states. An example of this are services which use
Type=notify, and rely on the service to tell systemd that it is up.
For these services, systemctl start (or stop) will exit (ceding control back to
Salt) and systemd will await its notification. In that interim, the
service will be in either the activating or deactivating state.

Importantly, Salt uses systemctl is-active service_name to check if
the service is up, and any state other than active results in a
nonzero exit code, which Salt interprets as the service being down.

So, if the notification doesn't come quick enough, then when Salt checks
on the service's status post-start, it will appear to Salt to be down
when it is actually in the activating state.

This commit modifies the systemd_service module such that, when the
status is activating or deactivating, the systemctl is-active will
be periodically retried, up to a tunable amount of time (by default 3
seconds), until some result other than activating or deactivating is
returned (or the timeout is reached, at which time the service will be
assumed to be down). This will keep services from being misinterpreted
as being dead when it just took a little longer than normal to start.

I realize that there is already an init_delay argument for this state,
but this always sleeps for that period of time, and also applies to
all service modules. The idea behind making changes to the
systemd_service module is to catch many issues like this before you
have to start troubleshooting why it's being identified as dead when
it's not actually dead. I'm open to suggestions.

@terminalmage terminalmage requested a review from a team as a code owner December 5, 2019 22:06
@ghost ghost requested a review from waynew December 5, 2019 22:07
@dwoz
Copy link
Contributor

dwoz commented Dec 23, 2019

@terminalmage I'm not sure what to do about the merge conflict, can you take a look when you have time?

@waynew
Copy link
Contributor

waynew commented Dec 27, 2019

How do services notify systemd? Would it be possible for Salt to either listen for this same sort of message, or have systemd then notify Salt?

This state will check if the service is running after it runs a
`service.start` on it. However, systemd has more potential states than
just up or down. It also has the concept of `activating` and
`deactivating` states. An example of this are services which use
`Type=notify`, and rely on the service to tell systemd that it is up.
For these services, `systemctl start` will exit (ceding control back to
Salt) and systemd will await its notification. In that interim, the
service will be in either the `activating` or `deactivating` state.

Importantly, Salt uses `systemctl is-active service_name` to check if
the service is up, and any state other than `active` results in a
nonzero exit code, which Salt interprets as the service being down.

So, if the notification doesn't come quick enough, then when Salt checks
on the service's status post-start, it will appear to Salt to be down
when it is actually in the `activating` state.

This commit modifies the `systemd_service` module such that, when the
status is `activating` or `deactivating`, the `systemctl is-active` will
be periodically retried, up to a tunable amount of time (by default 3
seconds), until some result other than `activating` or `deactivating` is
returned (or the timeout is reached, at which time the service will be
assumed to be down). This will keep services from being misinterpreted
as being dead when it just took a little longer than normal to start.

I realize that there is already an `init_delay` argument for this state,
but this _always_ sleeps for that period of time, and also applies to
all `service` modules. The idea behind making changes to the
`systemd_service` module is to catch many issues like this _before_ you
have to start troubleshooting why it's being identified as dead when
it's not actually dead. I'm open to suggestions.
@terminalmage terminalmage force-pushed the fix-systemd-race-condition branch from ba76029 to 4136ca7 Compare January 2, 2020 15:01
@terminalmage
Copy link
Contributor Author

How do services notify systemd?

Here's a good example: https://github.com/saltstack/salt/blob/d60060c/salt/utils/process.py#L161-L191

Would it be possible for Salt to either listen for this same sort of message, or have systemd then notify Salt?

Possible?... Maybe?

Feasible?... Arguably not.

@dwoz
Copy link
Contributor

dwoz commented Jan 2, 2020

@terminalmage Thanks for resolving the conflicts.

@dwoz dwoz merged commit 069fd09 into saltstack:master Jan 4, 2020
@@ -475,7 +487,7 @@ def running(name,
time.sleep(init_delay)

# only force a change state if we have explicitly detected them
after_toggle_status = __salt__['service.status'](name, sig)
after_toggle_status = __salt__['service.status'](name, sig, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug here when using service.running with reload keyword:

       [ERROR   ] An exception occurred in this state: Traceback (most recent call last):
         File "/usr/lib/python3/dist-packages/salt/state.py", line 1981, in call
           **cdata['kwargs'])
         File "/usr/lib/python3/dist-packages/salt/loader.py", line 1977, in wrapper
           return f(*args, **kwargs)
         File "/usr/lib/python3/dist-packages/salt/states/service.py", line 490, in running
           after_toggle_status = __salt__['service.status'](name, sig, **kwargs)
       TypeError: status() got an unexpected keyword argument 'reload'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants