Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ephemeral nodes to Monsoon #3

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Sadi-a
Copy link

@Sadi-a Sadi-a commented Jul 27, 2023

Until now, Typhoon needed Flatcar Container Linux to be installed on a machine in order to start working on it. This impedes the implementation of ephemeral workers, hence the reason for this change. This commit allows one to set a variable enable_install which defaults to true, and boots the machines live without the need of installing Flatcar Container Linux while using a persistent partition for things such as SSH host keys or Kubernetes certificates. The disk used for this persistent partition will either be a disk specified with its path in the controller or worker's definition or the smallest disk available on a given machine as we do not need to store a lot of information. This is done by introducing conditions and logic in the yaml templates and sending the controller/worker ignition files instead of the install ignition file as well as making some changes to what is done through SSH, transferring it to the ignition file.

This means that, if a worker were to become unhealthy, after investigating the cause of its failure, it can simply be rebooted with an ipmi power reset command or a systemctl reboot and it will be reimaged and reprovisioned with the ignition file, except for some of the persistent data such as some Kubernetes certificates or manifests or SSH host keys.

This Pull Request only affects Monsoon's Flatcar Container Linux and does not change Monsoon's Fedora CoreOS.

This pull request has been tested with the test-suite : bootstrapping works as expected and a worker that is restarted registers in the cluster again after finishing its reboot.

@Sadi-a Sadi-a changed the title Adding ephemeral worker nodes to Monsoon Adding ephemeral nodes to Monsoon Jul 28, 2023
@Sadi-a Sadi-a force-pushed the sadi/live branch 2 times, most recently from af5b134 to 3ace3f1 Compare August 4, 2023 11:00
flatcar-linux/kubernetes/butane/controller.yaml Outdated Show resolved Hide resolved
flatcar-linux/kubernetes/butane/controller.yaml Outdated Show resolved Hide resolved
flatcar-linux/kubernetes/worker/ssh.tf Outdated Show resolved Hide resolved
flatcar-linux/kubernetes/profiles.tf Outdated Show resolved Hide resolved
@Sadi-a Sadi-a force-pushed the sadi/live branch 6 times, most recently from 4db4e1a to 6bdaff2 Compare August 11, 2023 16:49
@Sadi-a Sadi-a requested a review from Snaipe August 14, 2023 16:08
@Sadi-a Sadi-a force-pushed the sadi/live branch 3 times, most recently from 5e7878a to 7cf641c Compare December 13, 2023 13:55
Until now, Typhoon needed Flatcar Linux to be installed on a machine
in order to start working on it. This impedes the implementation of
ephemeral workers, hence the reason for this change. This commit
allows one to set a variable `enable_install` which defaults to true,
and boots the machines live without the need of installing Flatcar
Linux while using a persistent partition for things such as SSH host
keys or Kubernetes certificates. The disk used for this persistent
partition will either be a disk specified with its path in the
controller or worker's definition or the smallest disk available on a
given machine as we do not need to store a lot of information.
This is done by introducing conditions and logic  in the yaml
templates and sending the controller/worker ignition files instead of
the install ignition file as well as making some changes to what is
done through SSH, transferring it to the ignition file.

This means that, if a worker were to become unhealthy, after
investigating the cause of its failure, it can simply be rebooted with
an ipmi `power reset` command or a `systemctl reboot` and it will be
reimaged and reprovisioned with the ignition file, except for some
of the persistent data such as some Kubernetes certificates or
manifests or SSH host keys.
@Sadi-a
Copy link
Author

Sadi-a commented Dec 13, 2023

Had to work on this a little, there was an issue where etcd data, which is what is used to store Kubernetes objects in our case, was not persistent, and we would thus lose all our secrets, namespaces and so on when the controller reboots.
Finished fixing this and testing the fix

Formerly, the kubernetes manifests, such as kubeconfig, stored in the
persistent partitions of our nodes would stay even if we launched
terraform apply and rebooted our nodes to take the new configuration.
We would thus have a new kubeconfig file to use and given to a node
that may have been added, while the other nodes would keep the old
kubeconfig.
This commit fixes that by making sure we can overwrite the old
kubernetes manifests when we reboot after a terraform apply.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants