Fix race condition in service.running on systemd #55540

terminalmage · 2019-12-05T22:06:50Z

This state will check if the service is running after it runs a
service.start on it. However, systemd has more potential states than
just up or down. It also has the concept of activating and
deactivating states. An example of this are services which use
Type=notify, and rely on the service to tell systemd that it is up.
For these services, systemctl start (or stop) will exit (ceding control back to
Salt) and systemd will await its notification. In that interim, the
service will be in either the activating or deactivating state.

Importantly, Salt uses systemctl is-active service_name to check if
the service is up, and any state other than active results in a
nonzero exit code, which Salt interprets as the service being down.

So, if the notification doesn't come quick enough, then when Salt checks
on the service's status post-start, it will appear to Salt to be down
when it is actually in the activating state.

This commit modifies the systemd_service module such that, when the
status is activating or deactivating, the systemctl is-active will
be periodically retried, up to a tunable amount of time (by default 3
seconds), until some result other than activating or deactivating is
returned (or the timeout is reached, at which time the service will be
assumed to be down). This will keep services from being misinterpreted
as being dead when it just took a little longer than normal to start.

I realize that there is already an init_delay argument for this state,
but this always sleeps for that period of time, and also applies to
all service modules. The idea behind making changes to the
systemd_service module is to catch many issues like this before you
have to start troubleshooting why it's being identified as dead when
it's not actually dead. I'm open to suggestions.

dwoz · 2019-12-23T18:27:41Z

@terminalmage I'm not sure what to do about the merge conflict, can you take a look when you have time?

waynew · 2019-12-27T18:22:20Z

How do services notify systemd? Would it be possible for Salt to either listen for this same sort of message, or have systemd then notify Salt?

This state will check if the service is running after it runs a `service.start` on it. However, systemd has more potential states than just up or down. It also has the concept of `activating` and `deactivating` states. An example of this are services which use `Type=notify`, and rely on the service to tell systemd that it is up. For these services, `systemctl start` will exit (ceding control back to Salt) and systemd will await its notification. In that interim, the service will be in either the `activating` or `deactivating` state. Importantly, Salt uses `systemctl is-active service_name` to check if the service is up, and any state other than `active` results in a nonzero exit code, which Salt interprets as the service being down. So, if the notification doesn't come quick enough, then when Salt checks on the service's status post-start, it will appear to Salt to be down when it is actually in the `activating` state. This commit modifies the `systemd_service` module such that, when the status is `activating` or `deactivating`, the `systemctl is-active` will be periodically retried, up to a tunable amount of time (by default 3 seconds), until some result other than `activating` or `deactivating` is returned (or the timeout is reached, at which time the service will be assumed to be down). This will keep services from being misinterpreted as being dead when it just took a little longer than normal to start. I realize that there is already an `init_delay` argument for this state, but this _always_ sleeps for that period of time, and also applies to all `service` modules. The idea behind making changes to the `systemd_service` module is to catch many issues like this _before_ you have to start troubleshooting why it's being identified as dead when it's not actually dead. I'm open to suggestions.

terminalmage · 2020-01-02T15:13:39Z

How do services notify systemd?

Here's a good example: https://github.com/saltstack/salt/blob/d60060c/salt/utils/process.py#L161-L191

Would it be possible for Salt to either listen for this same sort of message, or have systemd then notify Salt?

Possible?... Maybe?

Feasible?... Arguably not.

dwoz · 2020-01-02T16:28:28Z

@terminalmage Thanks for resolving the conflicts.

sblaisot · 2020-02-14T16:16:10Z

salt/states/service.py

@@ -475,7 +487,7 @@ def running(name,
        time.sleep(init_delay)

    # only force a change state if we have explicitly detected them
-    after_toggle_status = __salt__['service.status'](name, sig)
+    after_toggle_status = __salt__['service.status'](name, sig, **kwargs)


There is a bug here when using service.running with reload keyword:

[ERROR ] An exception occurred in this state: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/salt/state.py", line 1981, in call **cdata['kwargs']) File "/usr/lib/python3/dist-packages/salt/loader.py", line 1977, in wrapper return f(*args, **kwargs) File "/usr/lib/python3/dist-packages/salt/states/service.py", line 490, in running after_toggle_status = __salt__['service.status'](name, sig, **kwargs) TypeError: status() got an unexpected keyword argument 'reload'

terminalmage requested a review from a team as a code owner December 5, 2019 22:06

ghost requested a review from waynew December 5, 2019 22:07

terminalmage force-pushed the fix-systemd-race-condition branch from ba76029 to 4136ca7 Compare January 2, 2020 15:01

Merge branch 'master' into fix-systemd-race-condition

b18f9b6

dwoz approved these changes Jan 4, 2020

View reviewed changes

dwoz merged commit 069fd09 into saltstack:master Jan 4, 2020

sblaisot reviewed Feb 14, 2020

View reviewed changes

sblaisot mentioned this pull request Feb 14, 2020

Salt 3000: having reload on service.running makes state fail #56167

Closed

terminalmage deleted the fix-systemd-race-condition branch February 13, 2024 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in service.running on systemd #55540

Fix race condition in service.running on systemd #55540

terminalmage commented Dec 5, 2019 •

edited

Loading

dwoz commented Dec 23, 2019

waynew commented Dec 27, 2019

terminalmage commented Jan 2, 2020

dwoz commented Jan 2, 2020

sblaisot Feb 14, 2020

Fix race condition in service.running on systemd #55540

Fix race condition in service.running on systemd #55540

Conversation

terminalmage commented Dec 5, 2019 • edited Loading

dwoz commented Dec 23, 2019

waynew commented Dec 27, 2019

terminalmage commented Jan 2, 2020

dwoz commented Jan 2, 2020

sblaisot Feb 14, 2020

Choose a reason for hiding this comment

terminalmage commented Dec 5, 2019 •

edited

Loading