The agent should clean up "lost" systemd units #180

soenkeliebau · 2021-05-27T13:42:56Z

We need to introduce some identifier which can be used by the agent to identify all services that were created by this agent.
Might be a fixed prefix, putting all services in a specific slice or something similar.

Currently, when the agent crashes and a pod is removed from Kubernetes the associated systemd unit will become orphaned and never get cleaned up.

pipern · 2021-09-09T13:42:49Z

If I understand my experience correctly, this is a significant risk - since it means on boot, the old/wrong systemd unit may happen to start first, and that unit's application would then bind to a TCP socket so the intended applicationcan not start correctly.

I see this now, since I have ended up with:

lrwxrwxrwx 1 root root 87 Sep  8 09:52 /etc/systemd/system/multi-user.target.wants/default-zookeeper-simple-server-default-vm1-45ll6-zookeeper.service -> /lib/systemd/system/default-zookeeper-simple-server-default-vm1-45ll6-zookeeper.service
lrwxrwxrwx 1 root root 87 Sep  6 15:57 /etc/systemd/system/multi-user.target.wants/default-zookeeper-simple-server-default-vm1-bhflq-zookeeper.service -> /lib/systemd/system/default-zookeeper-simple-server-default-vm1-bhflq-zookeeper.service

and so the journal is very noisy with the "wrong" zookeeper starting up and failing repeatedly due to the TCP port already being bound.

siegfriedweber · 2021-09-09T14:47:50Z

@pipern Yes, this issue has a high impact. The priority label and milestone reflect the urgency but not necessarily the severity.

pipern · 2021-09-09T15:00:35Z

Related to this, will the agent be periodically checking the systemd unit files (in a 'controller loop' fashion)? Such that if one is removed or edited by mistake, the agent will put it back?

I'm also thinking about which controller is the one doing the actual monitoring/controlling.

If a pod stops, then, as I understand it, the kubelet reports this to the controller and the controller schedules a replacement.

In stackable, if the daemon (e.g., zookeeper) stops, then actually systemd will restart it, so the krustlet would not report this to k8s. Do I understand correctly?

Similarly, on boot, "normally" in k8s the scheduler would allocate a pod to a node; so without kubelet running, the server would not run any applications. But in stackable, systemd will start up the applications regardless of the state of the krustlet or k8s. If the node has been offline a while, it means it could start up with out of date configuration, since krustlet hasn't had new information from the api-server while it was offline. This relates to https://docs.stackable.tech/home/adr/ADR005-systemd_unit_file_location.html - did you consider having the unit files on volatile storage, so during a reboot, they are intentionally lost, so that the operator is the one who decides if an application should start or not?

(https://github.com/stackabletech/agent/discussions is empty, but maybe that is the place for this. Shall I make a post there?)

siegfriedweber · 2021-09-10T08:56:46Z

@pipern Discussions is probably better suited for your questions because they are out of the scope of this issue. I am looking forward to answer them there.

pipern · 2021-09-13T09:03:35Z

@pipern Discussions is probably better suited for your questions because they are out of the scope of this issue. I am looking forward to answer them there.

Moved over to #303

soenkeliebau added type/enhancement New feature or request priority/medium labels May 27, 2021

lfrancke added this to the Release #1 milestone Aug 10, 2021

lfrancke added priority/low and removed priority/medium labels Aug 25, 2021

soenkeliebau modified the milestones: Release #1, Release #2 Sep 6, 2021

siegfriedweber self-assigned this Sep 8, 2021

siegfriedweber mentioned this issue Sep 23, 2021

Remove systemd units without a corresponding pod #312

Merged

4 tasks

siegfriedweber closed this as completed in #312 Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The agent should clean up "lost" systemd units #180

The agent should clean up "lost" systemd units #180

soenkeliebau commented May 27, 2021

pipern commented Sep 9, 2021

siegfriedweber commented Sep 9, 2021

pipern commented Sep 9, 2021

siegfriedweber commented Sep 10, 2021

pipern commented Sep 13, 2021

The agent should clean up "lost" systemd units #180

The agent should clean up "lost" systemd units #180

Comments

soenkeliebau commented May 27, 2021

pipern commented Sep 9, 2021

siegfriedweber commented Sep 9, 2021

pipern commented Sep 9, 2021

siegfriedweber commented Sep 10, 2021

pipern commented Sep 13, 2021