forked from poseidon/typhoon
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ephemeral nodes to Monsoon #3
Open
Sadi-a
wants to merge
2
commits into
aristanetworks:main
Choose a base branch
from
Sadi-a:sadi/live
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sadi-a
changed the title
Adding ephemeral worker nodes to Monsoon
Adding ephemeral nodes to Monsoon
Jul 28, 2023
Sadi-a
force-pushed
the
sadi/live
branch
2 times, most recently
from
August 4, 2023 11:00
af5b134
to
3ace3f1
Compare
Snaipe
reviewed
Aug 10, 2023
Sadi-a
force-pushed
the
sadi/live
branch
6 times, most recently
from
August 11, 2023 16:49
4db4e1a
to
6bdaff2
Compare
Sadi-a
force-pushed
the
sadi/live
branch
3 times, most recently
from
December 13, 2023 13:55
5e7878a
to
7cf641c
Compare
Until now, Typhoon needed Flatcar Linux to be installed on a machine in order to start working on it. This impedes the implementation of ephemeral workers, hence the reason for this change. This commit allows one to set a variable `enable_install` which defaults to true, and boots the machines live without the need of installing Flatcar Linux while using a persistent partition for things such as SSH host keys or Kubernetes certificates. The disk used for this persistent partition will either be a disk specified with its path in the controller or worker's definition or the smallest disk available on a given machine as we do not need to store a lot of information. This is done by introducing conditions and logic in the yaml templates and sending the controller/worker ignition files instead of the install ignition file as well as making some changes to what is done through SSH, transferring it to the ignition file. This means that, if a worker were to become unhealthy, after investigating the cause of its failure, it can simply be rebooted with an ipmi `power reset` command or a `systemctl reboot` and it will be reimaged and reprovisioned with the ignition file, except for some of the persistent data such as some Kubernetes certificates or manifests or SSH host keys.
Had to work on this a little, there was an issue where etcd data, which is what is used to store Kubernetes objects in our case, was not persistent, and we would thus lose all our secrets, namespaces and so on when the controller reboots. |
Formerly, the kubernetes manifests, such as kubeconfig, stored in the persistent partitions of our nodes would stay even if we launched terraform apply and rebooted our nodes to take the new configuration. We would thus have a new kubeconfig file to use and given to a node that may have been added, while the other nodes would keep the old kubeconfig. This commit fixes that by making sure we can overwrite the old kubernetes manifests when we reboot after a terraform apply.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Until now, Typhoon needed Flatcar Container Linux to be installed on a machine in order to start working on it. This impedes the implementation of ephemeral workers, hence the reason for this change. This commit allows one to set a variable
enable_install
which defaults to true, and boots the machines live without the need of installing Flatcar Container Linux while using a persistent partition for things such as SSH host keys or Kubernetes certificates. The disk used for this persistent partition will either be a disk specified with its path in the controller or worker's definition or the smallest disk available on a given machine as we do not need to store a lot of information. This is done by introducing conditions and logic in the yaml templates and sending the controller/worker ignition files instead of the install ignition file as well as making some changes to what is done through SSH, transferring it to the ignition file.This means that, if a worker were to become unhealthy, after investigating the cause of its failure, it can simply be rebooted with an ipmi
power reset
command or asystemctl reboot
and it will be reimaged and reprovisioned with the ignition file, except for some of the persistent data such as some Kubernetes certificates or manifests or SSH host keys.This Pull Request only affects Monsoon's Flatcar Container Linux and does not change Monsoon's Fedora CoreOS.
This pull request has been tested with the test-suite : bootstrapping works as expected and a worker that is restarted registers in the cluster again after finishing its reboot.