Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add kubeadm upgrades proposal #825

Closed
wants to merge 2 commits into from

Conversation

luxas
Copy link
Member

@luxas luxas commented Jul 19, 2017

This proposal has so far been developed in this Google doc: https://docs.google.com/document/d/1PRrC2tvB-p7sotIA5rnHy5WAOGdJJOIXPPv23hUFGrY/edit

Features issue: kubernetes/enhancements#296

@kubernetes/sig-cluster-lifecycle-proposals
@kubernetes/sig-onprem-proposals
for general review/approval

@kubernetes/sig-architecture-proposals
for review of Kubernetes upgrades with no external dependencies.
The aim is to provide an easy way to do upgrades against any cluster given some basic requirements long-term

@kubernetes/sig-api-machinery-proposals a heads up on the expected retry loop while trying to bind to a port

@kubernetes/sig-apps-proposals for being able to upgrade DaemonSets using an "add first, then delete" strategy

@kubernetes/sig-node-proposals for the expected checkpointing functionality

I'm not sure if the markdown converter I used could preserve styling well, if not, I'll update that in the coming days.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 19, 2017

What actually happens in the Static Pod -> Self-hosted transition?

First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you break all single-line comments into separate lines? Will be better for reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that


Notably, not etcd or kubelet for the moment.

Self-hosting will be a [phase](https://github.com/kubernetes/kubeadm/blob/master/docs/design/design.md) in the kubeadm perspective. As there are Static Pods on disk from earlier in the `kubeadm init` process, kubeadm can pretty easily parse that file and extract the PodSpec from the file. This PodSpec will be modified a little to fit the self-hosting purposes and be injected into a DaemonSet which will be created for all control plane components (API Server, Scheduler and Controller Manager). kubeadm will wait for the self-hosted control plane Pods to be running and then destroy the Static Pod-hosted control plane.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the DaemonSet controller continues to schedule DS pods, isn't it better for the controller manager to be a Deployment as opposed to a DaemonSet? If all control plane components run as DaemonSets, isn't the controller manager a single point of failure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubelet will keep the controller manager running. You could get into a state where the controller manager is unrunnable due to a configuration or coding error, but I don't see how that would be any better with deployments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @roberthbailey. Using deployments won't solve the problem. On the other hand, it'd be good to state clearly why DaemonSet is chosen to use here.


2. Build an upgrading Operator that does the upgrading for us

* Would consume a TPR with details how to do the upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRD instead of TPR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want kubeadm to be a building block for widely varying clusters, be cautious about requiring a custom resource be part of the base install path.


In v1.7 and v1.8, etcd runs in a Static Pod on the master. In v1.7, the default etcd version for Kubernetes is v3.0.17 and in k8s v1.8 the recommended version will be something like v3.1.10. In the v1.7->v1.8 upgrade path, we could offer upgrading etcd as well as an opt-in*. *This only applies to minor versions and 100% backwards-compatible upgrades like v3.0.x->v3.1.y, not backwards-incompatible upgrades like etcdv2 -> etcdv3.

The easiest way of achieving an etcd upgrade is probably to create a Job (with `.replicas=<masters>`, Pod Anti-Affinity and `.spec.parallellism=1`) of some kind that would upgrade the etcd Static Pod manifest on disk by writing a new manifest, waiting for it to restart cleanly or rollback, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add an example in the kubeadm repo.


What actually happens in the Static Pod -> Self-hosted transition?

First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pods get in the Running state...

I believe that the behavior described is only true for the apiserver. The KCM and scheduler should just run fine since they aren't trying to bind to host ports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both KCM and scheduler use host networks and binds ports, so I think this may be true.

On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to clarify what "running" means here.

These master components all have liveness checks. If they cannot bind to the ports for their healthz endpoints, kubelet may try to kill them repeatedly. You may be able to avoid this by extending the initial delay of the liveness checks though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any requirements for KCM and scheduler to use host networking? We've been successful running them both on the pod network & using service-account for apiserver access.

I believe the health check wouldn't come into play in the case of apiserver because it is just reaching out to the same host:port regardless if it's static manifest or the self-hosted "replacement". It's a bit disingenuous in that the healthcheck is really only testing one of them.

One problematic piece here is that the apiserver can't actually bind on the port, so it will be in a restart loop. If it does this enough then the backoff period can make it seem like something went wrong (we start the pivot by removing static pod, then the backoff period on the replacement causes it to not be restarted for a while).

As long as the kubelet itself doesn't die during that process, it will recover. It's just somewhat less than ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.

It shouldn't, they are setup to lock on endpoints by default.


Status as of v1.7: self hosted config file option exists; not split out into phase. The code is fragile and uses Deployments for the scheduler and controller-manager, which unfortunately [leads to a deadlock](https://github.com/kubernetes/kubernetes/issues/45717) at `kubeadm init` time. It does not leverage checkpointing so the entire cluster burns to the ground if you reboot your computer(s). Long story short; self-hosting is not production-ready in 1.7.

The rest of the document assumes a self-hosted cluster. It will not be possible to use 'kubeadm upgrade' a cluster unless it's self-hosted. The sig-cluster-lifecycle team will still provide some manual, documented steps that an user can follow to upgrade a non-self-hosted cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the document assumes a self-hosted cluster.

This applies to the whole doc, not just the rest, because the preceding section describes self-hosting. I think we need to state this at the top of the document rather than hiding it in this section. We should also provide a link at that point to the instructions for manually upgrading a non-self-hosted cluster.

Once this statement is at the top, then the section describing self-hosting can be seen as background. Right now this document is describing both how we implement self-hosting (do we have that described anywhere else?) and also how we do upgrades, nominally in a doc that is just supposed to explain upgrades.

Another option is to break out the self-hosting description into a different document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that breaking out the self-hosting parts is a good idea.
Also self-hosting didn't come out as important in this process as we had thought


Self-hosting implementation for v1.8

The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with node affinity (using nodeSelector) to masters.

Are we also going to use taints?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the node affinity needs to be a hard requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified that we're using the nodeSelector feature right now, not the "real" node affinity one


Self-hosting implementation for v1.8

The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since supporting “single masters” is a definite requirement,

remove "definite".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


Checkpointing

In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the kubelet will write

Remove Basically from the beginning of the sentence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


Checkpointing

In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a link to kubernetes/kubernetes#49236

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


11. Has to define an API (CRD or the like) between the client and the Operator

**Decision**: Keep the logic inside of the kubeadm CLI (option 1) for the implementation in v1.8.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I don't think it would be difficult to move the logic later (at least not more difficult than using an operator now).


One of the hardest parts with implementing the upgrade will be to respect customizations made by the user at `kubeadm init` time. The proposed solution would be to store the kubeadm configuration given at `init`-time in the API as a ConfigMap, and then retrieve that configuration at upgrade time, parse it using the API machinery and use it for the upgrade.

This highlights a very important point: **We have to get the kubeadm configuration API group to Beta (v1beta1) in time for v1.8.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this on track?


This is a stretch goal for v1.8, but not a strictly necessary feature.

Pre-pulling of images/ensuring the upgrade doesn’t take too long
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a subsection of alternatives considered.


2. Make sure the cluster is healthy

1. Make sure the API Server’s `/healthz` endpoint returns `ok`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we look at /componentstatuses too?


1. Make sure the API Server’s `/healthz` endpoint returns `ok`

2. Makes sure all Nodes return `Ready` status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also check node conditions?


What actually happens in the Static Pod -> Self-hosted transition?

First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both KCM and scheduler use host networks and binds ports, so I think this may be true.

On the other hand, if this is not true, there'd be two copies of KCM and scheduler running. That may cause unexpected consequences.


What actually happens in the Static Pod -> Self-hosted transition?

First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to clarify what "running" means here.

These master components all have liveness checks. If they cannot bind to the ports for their healthz endpoints, kubelet may try to kill them repeatedly. You may be able to avoid this by extending the initial delay of the liveness checks though.


Notably, not etcd or kubelet for the moment.

Self-hosting will be a [phase](https://github.com/kubernetes/kubeadm/blob/master/docs/design/design.md) in the kubeadm perspective. As there are Static Pods on disk from earlier in the `kubeadm init` process, kubeadm can pretty easily parse that file and extract the PodSpec from the file. This PodSpec will be modified a little to fit the self-hosting purposes and be injected into a DaemonSet which will be created for all control plane components (API Server, Scheduler and Controller Manager). kubeadm will wait for the self-hosted control plane Pods to be running and then destroy the Static Pod-hosted control plane.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @roberthbailey. Using deployments won't solve the problem. On the other hand, it'd be good to state clearly why DaemonSet is chosen to use here.


Self-hosting implementation for v1.8

The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the node affinity needs to be a hard requirement.


Self-hosting implementation for v1.8

The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the "single masters" requirement.

s/The current way DaemonSet upgrades currently operate is "remove/Currently, DaemonSet performs upgrades by "removing


Defining decent upgrading and version skew policies is important before implementing.

Definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it a header for readability, e.g., ##Definitions


6. Example: running `kubeadm upgrade apply --version v1.9.0` against a v1.8.2 control plane will error out if the nodes are still on v1.7.x

4. This means that there are possibly two kinds of upgrades kubeadm can do:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the upgrade policy, or what's derived from it? I feel like this section is mixed with policy, what kubeadm can do, and implementation specifics. I'd suggest splitting them if possible.


4. Example: The `system:nodes` ClusterRoleBinding had to lose its binding to the `system:nodes` Group when upgrading to v1.7; otherwise the Node Authorizer wouldn’t have had any effect.

5. Using kubeadm, you must upgrade the control plane atomically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the upgrade failed, would kubeadm roll back the changes?


Only *control plane components* will be in-scope for self-hosting for v1.8.

Notably, not etcd or kubelet for the moment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do users of kubeadm upgrade kubelets today?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done via package manager. yum/apt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be worth linking to the expected node upgrade process or having a brief description. I was wondering how the nodes get upgraded as well.


2. Build an upgrading Operator that does the upgrading for us

* Would consume a TPR with details how to do the upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's TPR? Couldn't find it in this proposal...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's CRD.

@roberthbailey
Copy link
Contributor

Thanks for the comments @yujuhong!

@roberthbailey roberthbailey self-assigned this Jul 28, 2017

What actually happens in the Static Pod -> Self-hosted transition?

First the PodSpec is modified. For instance, instead of hard-coding an IP for the API server to listen on it is dynamically fetched from the Downward API. When it’s ok to proceed, the self-hosted control plane DaemonSets are posted to the Static Pod API Server. The Static Pod API Server starts the self-hosted Pods normally. The Pods get in the Running state, but fails to bind to the port on the host and backs off internally. During that time, kubeadm notices the Running state of the self-hosted Pod, and deletes the Static Pod file which leads to that the kubelet stops the Static Pod-hosted component immediately. On the next run, the self-hosted component will try to bind to the port, which now is free, and will succeed. A self-hosted control plane has been created!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have these steps in a numbered list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do


Self-hosting implementation for v1.8

The API Server, controller manager and scheduler will all be deployed in DaemonSets with node affinity (using `nodeSelector`) to masters. The current way DaemonSet upgrades currently operate is "remove an old Pod, then add a new Pod". Since supporting “single masters” is a definite requirement, we have to either workaround removing the only replica of the scheduler/controller-manager by duplicating the existing DaemonSets (e.g. a `temp-self-hosted-kube-apiserver` DS will be created as a copy of the normal `self-hosted-kube-apiserver` DS during the upgrade) or ask for a [new upgrade strategy](https://github.com/kubernetes/kubernetes/issues/48841) (“add first, then delete”, which is generally useful) from sig-apps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 add first, then delete. That sounds much nicer than hacking around it with temporary duplicates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this the hacky way for now, as the UpdateStrategy didn't make v1.8
However, we definitely want add first, then delete longer-term


Checkpointing

In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that if the Pods are still running, the Kubelet doesn't create duplicates on restart. There are many reasons the Kubelet could restart without a full node reboot: OOM kill, dynamic config, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothysc see comment on checkpointing ^


Checkpointing

In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, why wouldn't we just checkpoint all pods by default? How much disk space would this consume? Would it be small enough that we could require opt-out instead of opt-in?


In order to be able to reboot a self-hosted cluster (e.g. a single, self-hosted master), there has to be some kind of checkpointing mechanism. Basically the kubelet will write some state to disk for Pods that opt-in to checkpointing ("I’m running an API Server, let’s write that down so I remember it", etc.). Then if the kubelet reboots (in the single-master case for example), it will check the state store for Pods it was running before the reboot. It discovers that it ran an API Server, Controller Manager and Scheduler and starts those Pods now as well.

This solves the chicken-and-egg problem that otherwise would occur when the kubelet comes back up, where the kubelet tries to connect to the API server, but the API server hasn’t been started yet, it should be running as a Pod on that kubelet, although it isn’t aware of that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, you're saying that checkpointing solves the problem of the Kubelet hosting itself in a Pod if it can't contact the API server and learn that it should?

Side question: If we have checkpoints, can we start provisioning initial checkpoints instead of static pods?


2. Upgrading to a higher minor release than the kubeadm CLI version will **not** be supported.

3. Example: kubeadm v1.8.3 can upgrade your v1.8.3 cluster to v1.8.6 if you specify `--force` at the time of the upgrade, but kubeadm can never upgrade your v1.8.3 cluster to v1.9.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: be a little more specific - s/but kubeadm can never/but kubeadm v1.8.3 can never...


3. The control plane must be upgraded before the kubelets in the cluster

5. For kubeadm; the maximum amount of skew between the control plane and the kubelets is *one minor release*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support two trailing versions for clients. API changes are supposed to be backwards-compatible, so this shouldn't be a problem for kubeadm. In theory.
I'm not sure whether we should try to support skipping a minor version during an upgrade though. I think people typically do this one version at a time.

@roberthbailey
Copy link
Contributor

@luxas - you mentioned on Friday that you were going to wait until Tuesday to collect feedback and then update the proposal.

Friendly ping to let us know when the PR is ready to be re-reviewed.

@luxas
Copy link
Member Author

luxas commented Aug 9, 2017

@roberthbailey I changed my mind ;)
I found it more valuable to actually focus my hours on working on the code in this phase of the cycle and get dependent PRs merged.
The WIP is here: kubernetes/kubernetes#48899

Feel free to take it for a spin.
I'm gonna do the cleanup of this doc later when more of the actual dependent code is merged, otherwise it would be too tight. This proposal as-is had consensus in the SIG, and the remaining comments are minor, wording or clarifications, so it has lower priority for me than actually shipping the code.

I hope I can get to this by the end of next week at least.

Finally, thank you everyone for commenting -- I'm sorry I'm overloaded with other things at the moment plus the upgrades impl., I'll answer your questions as soon as I can.

@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 15, 2017
@roberthbailey
Copy link
Contributor

@luxas - should we clean this up and get it merged (now that kubeadm supports upgrade)?

@luxas
Copy link
Member Author

luxas commented Oct 10, 2017 via email

@castrojo
Copy link
Member

This change is Reviewable

@luxas luxas changed the title Add kubeadm upgrades proposal [WIP] Add kubeadm upgrades proposal Oct 25, 2017
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 25, 2017
@luxas
Copy link
Member Author

luxas commented Oct 25, 2017

@timothysc @roberthbailey Please read this doc up to the "Various other notes"; I need to finish the last sections there yet. Thanks!

# kubeadm upgrades proposal

Authors: Lucas Käldström & the SIG Cluster Lifecycle team
Last updated: October 2017
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a newline before this so that there is a line break in the markdown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


## Abstract

This proposal describes how kubeadm will support a upgrading clusters in an user-friendly and automated way using different techniques.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support upgrading (remove the 'a')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops


## Abstract

This proposal describes how kubeadm will support a upgrading clusters in an user-friendly and automated way using different techniques.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what different techniques? shouldn't there be a single technique that we plan to use for all kubeadm clusters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the ambiguity; I mean that kubeadm can actually perform different tasks for different clusters under the hood (self-hosted vs static pod hosted), but that isn't relevant here

- Support for upgrading the API Server, the Controller Manager, the Scheduler, kube-dns & kube-proxy
- Support for performing necessary post-upgrade steps like upgrading Bootstrap Tokens that were alpha in v1.6 & v1.7 to beta ones in v1.8
- Automated e2e tests running.
- GA in v1.10:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is the plan, since 1.10 isn't out yet and kubeadm itself it still beta.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, I left this as in a future version now instead

## Graduation requirements

- Beta in v1.8:
- Support for the `kubeadm upgrade plan` and `kubeadm upgrade apply` commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some bulleted lines end with periods and others don't which is strange. please make them consistent one way or the other


**Rewrite manifests completely from config object or mutate existing manifest:**

Instead of generating new manifests from a versioned configuration object, we could try to add "filters" to the existing manifests and apply different filters depending on what the upgrade looks like. This approach, to modify existing manifests in various ways depending on the version bump has some pros (modularity, simple mutating functions), but the matrix of functions and different policies would grow just too big for this kind of system, so we voted against this alternative in favor for the solution above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we actually vote? maybe say decided instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we didn't actually vote. Thanks for the reword suggestion there.

- In kubeadm v1.4 to v1.8, this is the default way of setting up the control plane
- Running the control plane in Kubernetes-hosted containers as DaemonSets; aka. [Self-Hosting](#TODO)
- When creating a self-hosted cluster; kubeadm first creates a Static Pod-hosted cluster and then pivots to the self-hosted control plane.
- This is the default way to deploy the control plane since v1.9; but the user can opt-out if it and stick with the Static Pod-hosted cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/since/beginning with/

also replace ; with ,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- Creates a **backup directory** with the prefix `/etc/kubernetes/tmp/kubeadm-backup-manifests*`.
- Q: Why `/etc/kubernetes/tmp`?
- A: Possibly not very likely, but we concluded that there may be an attack area for computers where `/tmp` is shared and writable by all users.
We wouldn't want anyone to mock with the new Static Pod manifests being applied to the clusters. Hence we chose `/etc/kubernetes/tmp`, which is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace mock with muck or mess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, typo ;)

- A: Possibly not very likely, but we concluded that there may be an attack area for computers where `/tmp` is shared and writable by all users.
We wouldn't want anyone to mock with the new Static Pod manifests being applied to the clusters. Hence we chose `/etc/kubernetes/tmp`, which is
root-only owned.
- In a loop for all control plane components:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the order of the control plane components matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment that order shouldn't matter, but we do apiserver, ctlr-mgr, sched

For instance, if the scheduler doesn't come up cleanly; kubeadm will rollback the previously (successfully upgraded) API server, controller manager
manifests as well as the scheduler manifest.

#### Self-hosted control plane
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this section isn't finished so i'm going to stop reviewing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for the review so far!

@roberthbailey
Copy link
Contributor

Please poke when the remainder of the doc is finished so I can hopefully just do one more pass.

@0xmichalis
Copy link
Contributor

@luxas @roberthbailey should this doc move inside https://github.com/kubernetes/community/tree/master/contributors/design-proposals/cluster-lifecycle?

@0xmichalis 0xmichalis added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Nov 2, 2017
@luxas
Copy link
Member Author

luxas commented Nov 4, 2017

@Kargakis yes indeed. I initially filed this PR before we did that, will update on my next round here.

@fejta fejta added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. keep-open labels Dec 15, 2017
@kubernetes kubernetes deleted a comment from k8s-github-robot Dec 15, 2017
@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Feb 6, 2018
@roberthbailey
Copy link
Contributor

This PR has been idle coming up on 2 years now, so I think it should be closed.

/close

@k8s-ci-robot
Copy link
Contributor

@roberthbailey: Closed this PR.

In response to this:

This PR has been idle coming up on 2 years now, so I think it should be closed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.