Skip to content
This repository has been archived by the owner on Dec 21, 2021. It is now read-only.

Remove systemd units without a corresponding pod #312

Merged
merged 3 commits into from
Sep 24, 2021
Merged

Conversation

siegfriedweber
Copy link
Member

@siegfriedweber siegfriedweber commented Sep 23, 2021

Description

On startup the systemd units in the system-stackable slice are compared to the pods assigned to this node. If a systemd unit is as expected then it is kept and the Stackable Agent will take ownership again in a later stage. If there is no corresponding pod or the systemd unit differs from the pod specification then it is removed and the Stackable Agent will create a new systemd unit afterwards.

Closes #180

Test

It is not possible to test this change with the agent-integration-tests because systemd units must be prepared and the Stackable Agent must be started afterwards, which is not possible over the Kubernetes API. Therefore it must be tested manually.

The following script can be used for manual testing:

#!/bin/sh

setup_stackable_repository() {
echo Setup Stackable repository

echo -n "
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: repositories.stable.stackable.de
spec:
  group: stable.stackable.de
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                repo_type:
                  type: string
                properties:
                  type: object
                  additionalProperties:
                    type: string
  scope: Namespaced
  names:
    plural: repositories
    singular: repository
    kind: Repository
    shortNames:
    - repo
" | kubectl apply -f -

echo -n "
apiVersion: stable.stackable.de/v1
kind: Repository
metadata:
  name: integration-test-repository
  namespace: default
spec:
  repo_type: StackableRepo
  properties:
    url: https://raw.githubusercontent.com/stackabletech/integration-test-repo/main/
" | kubectl apply -f -
}

setup_unit_with_pod() {
echo Setup unit with pod

echo -n "[Unit]
Description=default-cleanup-test-ok-noop-service
StartLimitIntervalSec=0

[Service]
Environment=\"KUBECONFIG=/root/.kube/config\"
ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
RemainAfterExit=no
Restart=always
RestartSec=2
Slice=system-stackable.slice
StandardError=journal
StandardOutput=journal
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target" > /lib/systemd/system/default-cleanup-test-ok-noop-service.service

systemctl daemon-reload
systemctl enable default-cleanup-test-ok-noop-service.service
systemctl start default-cleanup-test-ok-noop-service.service

echo "
apiVersion: v1
kind: Pod
metadata:
  name: cleanup-test-ok
spec:
  containers:
    - name: noop-service
      image: noop-service:1.0.0
      command:
        - noop-service-1.0.0/start.sh
  nodeName: localhost
  nodeSelector:
    kubernetes.io/arch: stackable-linux
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: stackable-linux
" | kubectl apply -f -
}

setup_unit_without_pod() {
echo Setup unit without pod

echo -n "[Unit]
Description=default-cleanup-test-no-pod-noop-service
StartLimitIntervalSec=0

[Service]
Environment=\"KUBECONFIG=/root/.kube/config\"
ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
RemainAfterExit=no
Restart=always
RestartSec=2
Slice=system-stackable.slice
StandardError=journal
StandardOutput=journal
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target" > /lib/systemd/system/default-cleanup-test-no-pod-noop-service.service

systemctl daemon-reload
systemctl enable default-cleanup-test-no-pod-noop-service.service
systemctl start default-cleanup-test-no-pod-noop-service.service
}

setup_unit_with_unexpected_content() {
echo Setup unit with pod

echo -n "[Unit]
Description=You did not expect this, did you?

[Service]
ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
Slice=system-stackable.slice

[Install]
WantedBy=multi-user.target" > /lib/systemd/system/default-cleanup-test-unexpected-content-noop-service.service

systemctl daemon-reload
systemctl enable default-cleanup-test-unexpected-content-noop-service.service
systemctl start default-cleanup-test-unexpected-content-noop-service.service

echo "
apiVersion: v1
kind: Pod
metadata:
  name: cleanup-test-unexpected-content
spec:
  containers:
    - name: noop-service
      image: noop-service:1.0.0
      command:
        - noop-service-1.0.0/start.sh
  nodeName: localhost
  nodeSelector:
    kubernetes.io/arch: stackable-linux
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: stackable-linux
" | kubectl apply -f -
}

setup_unit_with_terminating_pod() {
echo Setup unit with terminating pod

echo -n "[Unit]
Description=default-cleanup-test-terminating-noop-service
StartLimitIntervalSec=0

[Service]
Environment=\"KUBECONFIG=/root/.kube/config\"
ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
RemainAfterExit=no
Restart=always
RestartSec=2
Slice=system-stackable.slice
StandardError=journal
StandardOutput=journal
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target" > /lib/systemd/system/default-cleanup-test-terminating-noop-service.service

systemctl daemon-reload
systemctl enable default-cleanup-test-terminating-noop-service.service
systemctl start default-cleanup-test-terminating-noop-service.service

echo "
apiVersion: v1
kind: Pod
metadata:
  name: cleanup-test-terminating
spec:
  containers:
    - name: noop-service
      image: noop-service:1.0.0
      command:
        - noop-service-1.0.0/start.sh
  nodeName: localhost
  nodeSelector:
    kubernetes.io/arch: stackable-linux
  tolerations:
    - key: kubernetes.io/arch
      operator: Equal
      value: stackable-linux
" | kubectl apply -f -

kubectl delete pod cleanup-test-terminating &
}

setup_stackable_repository
setup_unit_with_pod
setup_unit_without_pod
setup_unit_with_unexpected_content
setup_unit_with_terminating_pod

The log output of the Stackable Agent should be:

[2021-09-24T11:26:39Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-unexpected-content-noop-service.service] will be removed because it differs from the corresponding pod specification.
    expected content:
    [Unit]
    Description=default-cleanup-test-unexpected-content-noop-service
    StartLimitIntervalSec=0

    [Service]
    Environment="KUBECONFIG=/root/.kube/config"
    ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
    RemainAfterExit=no
    Restart=always
    RestartSec=2
    Slice=system-stackable.slice
    StandardError=journal
    StandardOutput=journal
    TimeoutStopSec=30

    [Install]
    WantedBy=multi-user.target

    actual content:
    [Unit]
    Description=You did not expect this, did you?

    [Service]
    ExecStart=/opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
    Slice=system-stackable.slice

    [Install]
    WantedBy=multi-user.target
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-terminating-noop-service.service] will be removed because the corresponding pod is terminating.
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-ok-noop-service.service] will be kept because a corresponding pod exists.
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-no-pod-noop-service.service] will be removed because no corresponding pod exists.
[2021-09-24T11:26:39Z INFO  warp::server] TlsServer::run; addr=127.0.0.1:3000
[2021-09-24T11:26:39Z INFO  warp::server] listening on https://127.0.0.1:3000
[2021-09-24T11:26:39Z INFO  krator::runtime] Got a watch restart. Resyncing queue...
[2021-09-24T11:26:39Z INFO  krator::runtime] Finished resync of objects.
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::downloading] Looking for package: noop-service:1.0.0 in known repositories
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::downloading] Package noop-service:1.0.0 has already been downloaded to "/opt/stackable/packages/_download", continuing with installation
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::downloading] Looking for package: noop-service:1.0.0 in known repositories
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::downloading] Package noop-service:1.0.0 has already been downloaded to "/opt/stackable/packages/_download", continuing with installation
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::installing] Package noop-service:1.0.0 has already been installed
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::installing] Package noop-service:1.0.0 has already been installed
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::creating_service] Creating service unit for service default-cleanup-test-ok
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::creating_service] Creating service unit for service default-cleanup-test-unexpected-content
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::starting] Starting systemd unit [default-cleanup-test-unexpected-content-noop-service.service]
[2021-09-24T11:26:39Z INFO  stackable_agent::provider::states::pod::starting] Enabling systemd unit [default-cleanup-test-unexpected-content-noop-service.service]
[2021-09-24T11:26:40Z INFO  stackable_agent::provider::states::pod::terminated] Pod default-cleanup-test-terminating was terminated

Leftover

The implementation of SystemDUnit was adapted as far as necessary but a complete refactoring would be required. This will be done in #244.

Review Checklist

  • Code contains useful comments
  • (Integration-)Test cases added (or not applicable)
  • Documentation added (or not applicable)
  • Changelog updated (or not applicable)

On startup the systemd units in the system-stackable slice are compared
to the pods assigned to this node. If a systemd unit is as expected then
it is kept and the Stackable agent will take ownership again in a later
stage. If there is no corresponding pod or the systemd unit differs from
the pod specification then it is removed and the Stackable Agent will
create a new systemd unit afterwards.
@siegfriedweber siegfriedweber requested a review from a team September 23, 2021 10:30
@siegfriedweber siegfriedweber self-assigned this Sep 23, 2021
maltesander
maltesander previously approved these changes Sep 23, 2021
Copy link
Member

@maltesander maltesander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got similar output during testing. LGTM.

@siegfriedweber
Copy link
Member Author

Systemd units where the corresponding pod is terminating, are removed now.

I extended the test script with the function setup_unit_with_terminating_pod.

State before starting the Stackable Agent:

# kubectl get pod cleanup-test-terminating
NAME                       READY   STATUS        RESTARTS   AGE
cleanup-test-terminating   0/1     Terminating   0          14s

# systemctl status default-cleanup-test-terminating-noop-service.service
● default-cleanup-test-terminating-noop-service.service - default-cleanup-test-terminating-noop-service
   Loaded: loaded (/usr/lib/systemd/system/default-cleanup-test-terminating-noop-service.service; enabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/default-cleanup-test-terminating-noop-service.service.d
           └─zzz-lxc-service.conf
   Active: active (running) since Fri 2021-09-24 11:52:10 UTC; 24s ago
 Main PID: 27865 (start.sh)
   CGroup: /system.slice/system-stackable.slice/default-cleanup-test-terminating-noop-service.service
           ├─27865 /bin/sh /opt/stackable/packages/noop-service-1.0.0/noop-service-1.0.0/start.sh
           └─27866 sleep 1d

Sep 24 11:52:10 centos7 systemd[1]: Started default-cleanup-test-terminating-noop-service.
Sep 24 11:52:10 centos7 start.sh[27865]: test-service started

Corresponding log output of the Stackable Agent:

[2021-09-24T11:54:43Z INFO  stackable_agent::provider::cleanup] The systemd unit [default-cleanup-test-terminating-noop-service.service] will be removed because the corresponding pod is terminating.
[2021-09-24T11:54:45Z INFO  stackable_agent::provider::states::pod::terminated] Pod default-cleanup-test-terminating was terminated

State after starting the Stackable Agent:

# kubectl get pod cleanup-test-terminating
Error from server (NotFound): pods "cleanup-test-terminating" not found

# systemctl status default-cleanup-test-terminating-noop-service.service
Unit default-cleanup-test-terminating-noop-service.service could not be found.

Copy link
Member

@soenkeliebau soenkeliebau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - gave the code a quick glance and tested it on a Debian T2 cluster

@siegfriedweber siegfriedweber merged commit aacbc1e into main Sep 24, 2021
@siegfriedweber siegfriedweber deleted the cleanup branch September 24, 2021 13:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The agent should clean up "lost" systemd units
3 participants