Skip to content
This repository has been archived by the owner on Dec 21, 2021. It is now read-only.

The agent should clean up "lost" systemd units #180

Closed
soenkeliebau opened this issue May 27, 2021 · 5 comments · Fixed by #312
Closed

The agent should clean up "lost" systemd units #180

soenkeliebau opened this issue May 27, 2021 · 5 comments · Fixed by #312
Assignees
Labels
priority/low type/enhancement New feature or request
Milestone

Comments

@soenkeliebau
Copy link
Member

We need to introduce some identifier which can be used by the agent to identify all services that were created by this agent.
Might be a fixed prefix, putting all services in a specific slice or something similar.

Currently, when the agent crashes and a pod is removed from Kubernetes the associated systemd unit will become orphaned and never get cleaned up.

@lfrancke lfrancke added this to the Release #1 milestone Aug 10, 2021
@soenkeliebau soenkeliebau modified the milestones: Release #1, Release #2 Sep 6, 2021
@siegfriedweber siegfriedweber self-assigned this Sep 8, 2021
@pipern
Copy link
Contributor

pipern commented Sep 9, 2021

If I understand my experience correctly, this is a significant risk - since it means on boot, the old/wrong systemd unit may happen to start first, and that unit's application would then bind to a TCP socket so the intended applicationcan not start correctly.

I see this now, since I have ended up with:

lrwxrwxrwx 1 root root 87 Sep  8 09:52 /etc/systemd/system/multi-user.target.wants/default-zookeeper-simple-server-default-vm1-45ll6-zookeeper.service -> /lib/systemd/system/default-zookeeper-simple-server-default-vm1-45ll6-zookeeper.service
lrwxrwxrwx 1 root root 87 Sep  6 15:57 /etc/systemd/system/multi-user.target.wants/default-zookeeper-simple-server-default-vm1-bhflq-zookeeper.service -> /lib/systemd/system/default-zookeeper-simple-server-default-vm1-bhflq-zookeeper.service

and so the journal is very noisy with the "wrong" zookeeper starting up and failing repeatedly due to the TCP port already being bound.

@siegfriedweber
Copy link
Member

@pipern Yes, this issue has a high impact. The priority label and milestone reflect the urgency but not necessarily the severity.

@pipern
Copy link
Contributor

pipern commented Sep 9, 2021

Related to this, will the agent be periodically checking the systemd unit files (in a 'controller loop' fashion)? Such that if one is removed or edited by mistake, the agent will put it back?

I'm also thinking about which controller is the one doing the actual monitoring/controlling.

If a pod stops, then, as I understand it, the kubelet reports this to the controller and the controller schedules a replacement.

In stackable, if the daemon (e.g., zookeeper) stops, then actually systemd will restart it, so the krustlet would not report this to k8s. Do I understand correctly?

Similarly, on boot, "normally" in k8s the scheduler would allocate a pod to a node; so without kubelet running, the server would not run any applications. But in stackable, systemd will start up the applications regardless of the state of the krustlet or k8s. If the node has been offline a while, it means it could start up with out of date configuration, since krustlet hasn't had new information from the api-server while it was offline. This relates to https://docs.stackable.tech/home/adr/ADR005-systemd_unit_file_location.html - did you consider having the unit files on volatile storage, so during a reboot, they are intentionally lost, so that the operator is the one who decides if an application should start or not?

(https://github.com/stackabletech/agent/discussions is empty, but maybe that is the place for this. Shall I make a post there?)

@siegfriedweber
Copy link
Member

@pipern Discussions is probably better suited for your questions because they are out of the scope of this issue. I am looking forward to answer them there.

@pipern
Copy link
Contributor

pipern commented Sep 13, 2021

@pipern Discussions is probably better suited for your questions because they are out of the scope of this issue. I am looking forward to answer them there.

Moved over to #303

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority/low type/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants