Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: add support for sdnotify protocol #20528

Merged
merged 1 commit into from
May 3, 2024
Merged

agent: add support for sdnotify protocol #20528

merged 1 commit into from
May 3, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented May 3, 2024

Nomad agents expect to receive SIGHUP to reload their configuration. The signal handler for this is installed fairly late in agent startup, after the client or server components are up and running. This means that configuration management tools can potentially reload the configuration before the agent can handle it, causing the agent to crash.

We don't want to allow configuration reload during client or server component startup, because it would significantly complicate initialization. Instead, we'll implement the systemd notify protocol. This causes systemd to block sending configuration reload signals until the agent is actually ready. Users can still bypass this by sending signals directly.

Note that there are several Go libraries that implement the sdnotify protocol, but most are part of much larger projects which would create a lot of dependabot burden. The bits of the protocol we need are extremely simple to implement in a just a couple of functions.

For non-Linux or non-systemd Linux systems, this feature is a no-op. In future work we could potentially implement service notification for Windows as well.

Fixes: #3885


Note on compatibility testing. Successful reloads with Type=notify look like this:

Reloading Nomad Agent...
==> Caught signal: hangup
==> Reloading configuration...
2024-05-03T11:50:38.505-0400 [INFO] client.fingerprint_mgr: reloading fingerprinter: fingerprinter=cni
Reloaded Nomad Agent.

If the unit file is changed from Type=notify to Type=simple (default) but the agent is still running, we end up with an error in the logs like the following, but everything still works as expected:

nomad.service: Got notification message from PID 2965, but reception is disabled.

Nomad agents expect to receive `SIGHUP` to reload their configuration. The
signal handler for this is installed fairly late in agent startup, after the
client or server components are up and running. This means that configuration
management tools can potentially reload the configuration before the agent can
handle it, causing the agent to crash.

We don't want to allow configuration reload during client or server component
startup, because it would significantly complicate initialization. Instead,
we'll implement the systemd notify protocol. This causes systemd to block
sending configuration reload signals until the agent is actually ready. Users
can still bypass this by sending signals directly.

Note that there are several Go libraries that implement the sdnotify protocol,
but most are part of much larger projects which would create a lot of dependabot
burden. The bits of the protocol we need are extremely simple to implement in a
just a couple of functions.

For non-Linux or non-systemd Linux systems, this feature is a no-op. In future
work we could potentially implement service notification for Windows as well.

Fixes: #3885
@@ -26,6 +26,7 @@ After=network-online.target
# StartLimitInterval = 10s

[Service]
Type=notify
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: ideally we'd have this as Type=notify-reload because that's a little nicer, but that's not available until systemd 253, which only the very most recent LTS distros have.

@tgross tgross merged commit 54fc146 into main May 3, 2024
23 checks passed
@tgross tgross deleted the sdnotify branch May 3, 2024 17:42
Copy link

github-actions bot commented Jan 9, 2025

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 9, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad dies if HUP is sent during agent initialization
2 participants