Restarting the Zincati Service fails randomly #671

redRolf · 2021-11-04T19:03:23Z

Bug Report

The following happened:

Our CI/CD performs an update of Zincati configuration when changes are made. After the new .toml files have been uploaded the zincati service is restarted in order to load the latest configuration using the command: sudo systemctl restart zincati.service. But lately we are running into the problem, that the restart fails. The error message can be seen below. This causes the not only the CI/CD Pipeline to fail but also causes the server to enter a deadlocked state, where no applications running on the server are responsive and even trying to establish an SSH connection to fails. What is causing this issue and how can I prevent it?

● zincati.service - Zincati Update Agent
     Loaded: loaded (/usr/lib/systemd/system/zincati.service; enabled; vendor preset: enabled)
     Active: activating (start) since Tue 2021-11-02 19:47:47 UTC; 1s ago
       Docs: https://github.com/coreos/zincati
   Main PID: 940074 (zincati)
      Tasks: 7 (limit: 9430)
     Memory: 1.4M
        CPU: 58ms
     CGroup: /system.slice/zincati.service
             └─940074 /usr/libexec/zincati agent -v

Nov 02 19:47:47 re.intra.redguard.ch-fcos systemd[1]: Starting Zincati Update Agent...
Nov 02 19:47:47 re.intra.redguard.ch-fcos zincati[940074]: [INFO  zincati::cli::agent] starting update agent (zincati 0.0.23)
Nov 02 19:47:47 re.intra.redguard.ch-fcos zincati[940074]: [INFO  zincati::cincinnati] Cincinnati service: https://updates.coreos.fedoraproject.org
Nov 02 19:47:47 re.intra.redguard.ch-fcos zincati[940074]: [INFO  zincati::cli::agent] agent running on node '14c0d09360844ae5a6bed8904e81eefa', in update group 'default'
Nov 02 19:47:47 re.intra.redguard.ch-fcos zincati[940074]: [INFO  zincati::update_agent::actor] registering as the update driver for rpm-ostree
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]: [ERROR zincati::rpm_ostree::cli_deploy] rpm-ostree deploy --register-driver failed:
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]:     error: Transaction in progress: deploy --lock-finalization revision=a44c3b4d10b94db300d420cba76249b6c6de368fa1f93613796e50d3ee8b3568 --disallow-downgrade
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]:      You can cancel the current transaction with `rpm-ostree cancel`
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]:
Nov 02 19:47:48 re.intra.redguard.ch-fcos zincati[940074]:     retrying in 1s

Environment

What hardware/cloud provider/hypervisor is being used?

Exoscale FCOS Template

Expected Behavior

The command sudo systemctl restart zincati.service to restart the service without failing

Actual Behavior

Reproduction Steps

Run fedora coreOS
Run the command: sudo systemctl restart zincati.service this my randomly fail and cause a time out.

Other Information

I wasn't quiet sure what would be helpful information so I tried to only include what I thought was the most relevant information. But if you would like any other logs or me to test things I am more than happy to oblige.

The text was updated successfully, but these errors were encountered:

lucab · 2021-11-04T21:45:04Z

It looks like you are restarting Zincati in the middle of an upgrade, which then leaves rpm-ostreed.service busy dealing with the new deployment. It should be possible to recover from that state by canceling the transaction or restarting the rpm-ostree daemon.

Taking a step back, I think what you need is a Reload method, which we currently don't have.
Though it's unclear to me how you ended up in a scenario where you want to change Zincati configuration mid-fly. As you can see from this bug, you are racing with upgrades which are happening under your current/previous configuration.

lucab · 2021-11-04T21:49:00Z

For reference, the underlying bug is that we leak behind a transaction running in rpm-ostree daemon, even if the client had disappeared. This came up already in coreos/rpm-ostree#3194 (comment) and we should enhance the daemon so that the lifetime of the transaction is automatically bound to the caller.

redRolf · 2021-11-04T22:10:12Z

@lucab thank you very much for the clarifications. I was suspecting that something along those lines is happening.

To your question on how I ended up in this situation:

We have a maintenance window every Tuesday evening, so I created a .toml file to only allow zincati to reboot the server every Tuesday evening between 23:00-00:00
Once a month a fresh deployment is made to the server using a ci/cd pipeline (This pipeline runs before 23:00). During the deployment, the .toml file for the zincati service configuration is updated (just incase modifications were made to it)
Naturally after the .toml config file is updated the ci/cd pipeline tries to restart the zincati service.

So that is how I arrived at the situation. So the problem appears to be that zincati is trying to finalize an update and is waiting until its time window to restart the server comes around. Then once a month my ci/cd pipeline comes around and tries to restart the service which it doesn't like since it's trying to finalize the update (understandable)

So if I swap the order i.e. allow zincati to finalize updates before running the ci/cd pipeline I should be able to mitigate this problem to a large extent.

A Reload function would be awesome 😍 but I do fully understand that these features take time, effort and resources.

Aside: By finalizing the update I mean, that Zincati is either actively installing updates or is just waiting to reboot the server 😇 for my case and suggest approach it is not relevant which one it is.

I hope I understood you correctly :)

lucab · 2021-11-05T08:29:42Z

Thanks for the additional context.

Yes, it looks like you are currently racing with Zincati trying to eagerly fetch/stage updates beforehand (so that they are ready to be applied as soon as your configuration allows it).
But you could as well race with the finalization of updates (and rebooting), depending on the specific configuration and timings.

Unfortunately I don't currently have a perfect solution to suggest. Some mitigations could be:

having the CD job check rpm-ostree status for pending transactions before doing any restart
having the CD job restart both zincati and rpm-ostreed so that all previous pending state is flushed
implementing some graceful reload/restart in zincati itself (though in general not all old→new configuration combinations are possible)
implementing client-bound transactions in rpm-ostreed

redRolf · 2021-11-05T09:07:43Z

Good morning :) Thank you very much for your help and your suggested mitigations. I will try these :)
From my standpoint, we can close this issue, as these mitigations should solve my problem.

Thank you again and have a great day and weekend.

lucab · 2021-11-05T09:24:37Z

Ack, thanks! I will forward the last two bullet items to separate tickets (no ETA though, both of them may require quite a bit of work) and then close this.

lucab · 2021-11-05T16:34:24Z

Followup tickets at #673 and coreos/rpm-ostree#3206.

This was referenced Nov 5, 2021

dbus: support triggering graceful agent reset, configuration reload #673

Open

daemon: bind transaction span to client lifetime coreos/rpm-ostree#3206

Open

lucab closed this as completed Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting the Zincati Service fails randomly #671

Restarting the Zincati Service fails randomly #671

redRolf commented Nov 4, 2021

lucab commented Nov 4, 2021

lucab commented Nov 4, 2021 •

edited

Loading

redRolf commented Nov 4, 2021

lucab commented Nov 5, 2021 •

edited

Loading

redRolf commented Nov 5, 2021

lucab commented Nov 5, 2021

lucab commented Nov 5, 2021

Restarting the Zincati Service fails randomly #671

Restarting the Zincati Service fails randomly #671

Comments

redRolf commented Nov 4, 2021

Bug Report

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

lucab commented Nov 4, 2021

lucab commented Nov 4, 2021 • edited Loading

redRolf commented Nov 4, 2021

lucab commented Nov 5, 2021 • edited Loading

redRolf commented Nov 5, 2021

lucab commented Nov 5, 2021

lucab commented Nov 5, 2021

lucab commented Nov 4, 2021 •

edited

Loading

lucab commented Nov 5, 2021 •

edited

Loading