-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "DaemonSet create-first rolling update" design proposal. #977
Conversation
This document captures the results of the discussion here: kubernetes/kubernetes#48841 Relevant feature: kubernetes/enhancements#373
@diegs: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
then a rolling update of the controller-manager would delete all the old pods | ||
before the new pods were scheduled, leading to a control plane outage. | ||
|
||
This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a blocker with a HA master node setup with maxUnavailable set to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't affect HA deployments given all components talk to a loadbalanced endpoint of any kind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diegs We could point out that this is for the one-master case only, but that it is a generally useful feature, not a kubeadm specific thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't hostPorts likely to be common in self-hosted scenarios (one of the driving use cases for this proposal)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal is simply incompatible with hostPorts, barring some really clever work that isn't on any roadmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt In kubeadm's case; no. We're using DaemonSets with hostNetworking (due to an other scheduling bug)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using DaemonSets with hostNetworking
if you are using host networking, won't your new containers still be unable to bind to the same ports (since they'd still be held by the old container), and never go healthy, and fail upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SO_REUSEPORT ?
What I alluded to with "really clever work" would be to emulate that in the port-forwarding path, but that sounds disastrous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that special cases for hostPorts, hostNetworking, and other scarce resources like GPUs should not be a goal here. In particular, the behavior you describe is also true of Deployments.
For example (and I tested this), create a deployment where replicas = worker nodes, and use a hostport. Set maxUnavailable
to 0, and then update your deployment. The new pods will be stuck in pending since the scheduler can't place the new pods anywhere.
This behavior is also true for DaemonSets as you point out (updated in the doc). And yes, I can see how DaemonSets may more often use hostPorts / hostNetworking in practice, but I think the scheduler's job is to place pods subject to the intersection of their resource constraints, not do resource planning for you.
We should absolutely call out this risk in the documentation. But I don't think that it should be this code's job to try to foresee potential resource conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically agree. I don't think it is tractable to try to fix this special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be documented and highlighted.m The documentation can point the user to use the "delete before add" model.
Also, it is possible to test for a pod having a hostPath and refusing to do a "add before delete" option? (updated: it is noted in the document on line 136)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the documentation once we come to a decision about what (if any) constraints will be checked.
I think the two proposed approaches still in contention are:
- Add very visible documentation saying that certain classes of resource types (such as host networking, host ports, etc) may lead to unschedulable situations when using this strategy.
- Add documentation as in 1, but also do a hard validation check to disallow using this strategy in conjunction with
hostPorts
specifically.
I am becoming convinced that option 2 is somewhat reasonable since the DaemonSet controller has an explicit scheduling check for hostPort contention. But I would not be in favor of adding checks for any other resource types, at least with the initial implementation.
|
||
1. How are `hostPort`s handled? | ||
|
||
They are not handled as part of this proposal. We can either: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing non working configuration is a bad idea. I think option 3 should be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, you can reach this state with Deployments too. I'll defer on picking an answer to this question until we resolve that discussion.
then a rolling update of the controller-manager would delete all the old pods | ||
before the new pods were scheduled, leading to a control plane outage. | ||
|
||
This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diegs We could point out that this is for the one-master case only, but that it is a generally useful feature, not a kubeadm specific thing.
### Alternatives considered | ||
|
||
The `maxSurge` parameter could have been added to the existing `RollingUpdate` | ||
strategy (as is the case for Deployments). However, this would break backwards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break validation backwards compability
Please clarify that while all this would have been technically possible due to the shift to v1beta2; it was discarded just to be on the safe side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt In kubeadm's case; no. We're using DaemonSets with hostNetworking (due to an other scheduling bug)
|
||
They are not handled as part of this proposal. We can either: | ||
|
||
1. Attempt to determine a mechanism to handle this (unclear) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
point out that this may be a future improvement, but will not be taken into consideration right now.
Possibly for GA?
2. Note in the documentation that this may cause port conflicts that prevent | ||
new pods from scheduling successfully (and therefore updates from completing | ||
successfully) | ||
3. Add a validation check that rejects the combination of `hostPort` and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 please point out as the accepted proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
host ports are only one possible scarce resource (albeit the most likely). validating just that still allows for impossible-to-satisfy updates involving scarce resources, which is worse for daemonsets than for deployments because there is no possibility of finding another node with the available resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I remember DaemonSet controller checks only for HostPorts conflicts. It will not allow new pod to run on a node and it will cause dead lock.
|
||
TODO(diegs): investigate this. | ||
|
||
3. How are other shared resources (e.g. local devices, GPUs, specific core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do deployments handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they rely on the scheduler to locate nodes with available resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and if you think of DaemonSets as a special case of a Deployment with numReplicas = numNodes and antiaffinity then the scheduling will degenerate in the same way if you are using scarce resources.
It would require a lot of special-case engineering to work around all possible issues, and I'm sure there will always be more.
pods first, cleaning up the old pods once the new ones are running. As with | ||
Deployments, users will use a `maxSurge` parameter to control how many nodes | ||
are allowed to have >1 pod running during a rolling update. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the modifications that you intend to make to the API. Generally, we add the types and type modifications that we intend to make as part of the design. The design docs for StatefulSet and DaemonSet RollingUpdate are examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
`maxUnavailable` to be > 0, but this is not required with | ||
`SurgingRollingUpdate`. | ||
|
||
There are plans to create a `v1beta2` iteration of the extensions API, so it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v1beta2 API is enabled by default on master. We want to promote that surface to GA in 1.9 provided we get good feedback on the v1beta2 surface in 1.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted
may be possible to add the `maxSurge` strategy to `RollingUpdate` and deprecate | ||
`SurgingRollingUpdate` then. | ||
|
||
For more background see https://github.com/kubernetes/kubernetes/issues/48841. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is your proposal a better alternative to creating a new DaemonSet and only deleting the current DaemonSet after the first has saturated the desired number of Nodes? Can you discuss this more here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this strategy? Do you mean having the user create a new DaemonSet with a different name, and then deleting this one?
If so I can think of many reasons (retaining history, minimizing clutter, actually being able to introspect the status of the update, etc.) but I want to make sure I understand which you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Multiple DaemonSets should be listed as an alternative here, too.
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its acceptable to completely punt on hostPort conflicts. They are the most common way of communicating with applications in a DaemonSet. Using validation to block the creation of DaemonSets with this updateStrategy (as you suggest below) is one option, but it really limits the usefulness of the updateStrategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, hostPort conflicts need to be handled for SurgingRollingUpdate
to be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetkuo handled in what sense? Can you comment on the discussion immediately above this one? We could block it in validation (since I guess the code already has a special case to block it from a scheduling perspective), but besides that I'm not sure what else we can do.
3. Add a validation check that rejects the combination of `hostPort` and the | ||
`SurgingRollingUpdate` strategy. | ||
|
||
2. How will the scheduler handle hostPort collisions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless application designers implement clever coordination, it will attempt to schedule both Pods, and one of them will continuously fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not schedule the new pod. https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1253
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that true for the default scheduler as well (Assuming me move to it one day)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for the default scheduler pod will be created and will be in the Pending state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the behavior I observed.
`RollingUpdate` strategy only deletes pods up to the `maxUnavailable` threshold | ||
and then waits for the main reconciler create new pods to replace them). | ||
|
||
One minor complication is that the controller-manager's reconciliation loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually true. One Pod per node is best effort. In contrast to this proposal, which would relax this constraint, the community has requested exactly one Pod per node to provide better guarantees with respect to the overlap of lifecycles of the Pods in their DeamonSets. Right now it is possible for DaemonSet to schedule a new Pod while another Pod is still terminating. We consider this to be a bug (kubernetes/kubernetes#50477).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified that the invariant is not 100% enforced yet but it is a goal.
This strategy is incompatible with the invariant, so either we need to relax it during updates as mentioned, or not implement this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That bug is unexpected and was recently introduced by kubernetes/kubernetes#43077 (1.8). This behavior (don't create new pods until the old one is terminated) has been supported by DaemonSet controller for a long time.
3. How are other shared resources (e.g. local devices, GPUs, specific core | ||
types) handled? | ||
|
||
This is a generalization of the `hostPort` question. I think that in this case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that consuming GPU and specific core types can be considered out of scope. It is not a common use case for DaemonSets. For hostPath volumes, you could perhaps advise using filesystem locks to prevent data corruption when applications use this update strategy. If you combine filesystem locks, a burden on application implementors (at least initially), with a robust solution to hostPort conflicts, you have something that most applications can consume.
|
||
This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`. | ||
|
||
### Implementation plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things which aren't clear to me.
- When can I select this strategy? Does it have to be set on creation? Can I change the updateStrategy from RollingUpdate to SurgeRollingUpdate (currently updateStrategy is mutable)?
- How will the controller handle the case where the updateStrategy is mutated to SurgeRollingUpdate when RollingUpdate was previously selected and an update is in progress.
- How will the controller handle the case where the updateStrategy is mutated to RollingUpdate and SurgeRollingUpdate was previously selected and an update is in progress.
- Is this on or off by default? Do you intend to use feature gate flags make this an alpha feature, or will all clusters immediately get this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added answers in the "Considerations / questions" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding alpha/beta/feature gate, I defer to @luxas and others.
`maxUnavailable` to be > 0, but this is not required with | ||
`SurgingRollingUpdate`. | ||
|
||
There are plans to create a `v1beta2` iteration of the extensions API, so it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are plans to create a
v1beta2
iteration of the extensions API
do you mean v1beta2 of the apps API group? The goal is to eliminate the extensions API group, not perpetuate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be apps/v1beta2
. Since apps/v1beta2
is introduced in 1.8, this validation change can be made in apps/v1beta2
only if it's implemented in 1.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comments and updated doc.
I used a new commit so we wouldn't lose the comment threads. I can squash before I submit.
pods first, cleaning up the old pods once the new ones are running. As with | ||
Deployments, users will use a `maxSurge` parameter to control how many nodes | ||
are allowed to have >1 pod running during a rolling update. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
then a rolling update of the controller-manager would delete all the old pods | ||
before the new pods were scheduled, leading to a control plane outage. | ||
|
||
This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`. | ||
|
||
### Implementation plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added answers in the "Considerations / questions" section.
|
||
This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`. | ||
|
||
### Implementation plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding alpha/beta/feature gate, I defer to @luxas and others.
`RollingUpdate` strategy only deletes pods up to the `maxUnavailable` threshold | ||
and then waits for the main reconciler create new pods to replace them). | ||
|
||
One minor complication is that the controller-manager's reconciliation loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified that the invariant is not 100% enforced yet but it is a goal.
This strategy is incompatible with the invariant, so either we need to relax it during updates as mentioned, or not implement this.
|
||
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that special cases for hostPorts, hostNetworking, and other scarce resources like GPUs should not be a goal here. In particular, the behavior you describe is also true of Deployments.
For example (and I tested this), create a deployment where replicas = worker nodes, and use a hostport. Set maxUnavailable
to 0, and then update your deployment. The new pods will be stuck in pending since the scheduler can't place the new pods anywhere.
This behavior is also true for DaemonSets as you point out (updated in the doc). And yes, I can see how DaemonSets may more often use hostPorts / hostNetworking in practice, but I think the scheduler's job is to place pods subject to the intersection of their resource constraints, not do resource planning for you.
We should absolutely call out this risk in the documentation. But I don't think that it should be this code's job to try to foresee potential resource conflicts.
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
|
||
1. How are `hostPort`s handled? | ||
|
||
They are not handled as part of this proposal. We can either: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, you can reach this state with Deployments too. I'll defer on picking an answer to this question until we resolve that discussion.
3. Add a validation check that rejects the combination of `hostPort` and the | ||
`SurgingRollingUpdate` strategy. | ||
|
||
2. How will the scheduler handle hostPort collisions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the behavior I observed.
|
||
TODO(diegs): investigate this. | ||
|
||
3. How are other shared resources (e.g. local devices, GPUs, specific core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and if you think of DaemonSets as a special case of a Deployment with numReplicas = numNodes and antiaffinity then the scheduling will degenerate in the same way if you are using scarce resources.
It would require a lot of special-case engineering to work around all possible issues, and I'm sure there will always be more.
//--- | ||
// TODO: Update this to follow our convention for oneOf, whatever we decide it | ||
// to be. Same as Deployment `strategy.rollingUpdate`. | ||
// See https://github.com/kubernetes/kubernetes/issues/35345 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
strategy. | ||
|
||
This document presents a design for a new "add first, then delete" rolling | ||
update strategy, called `SurgingRollingUpdate`. This strategy will create new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this new DaemonSet update strategy is similar to Deployment's RollingUpdate (i.e. MaxSurge
& MaxUnavailability
), it can support both "delete first" and "add first" (depending on whether MaxUnavailability
is 0 or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is noted in Alternatives Considered below. It was decided in kubernetes/kubernetes#48841 to add a new strategy so validation would remain backwards-compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its a great idea to converge on strategies across all controllers. We could add MaxSurge and MaxUnavailability and keep the default as MaxSurge=0 and MaxUnavailability as 1 to keep it backward compatible. @diegs how does this break validation ?
2. Determine how many previous-generation pods can be retired once their | ||
replacement pods are scheduled and running. | ||
3. Instruct the controller-manager to add and delete pods based on the results | ||
of (1) and (2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SurgingRollingUpdate
will create a new DaemonSet pod being created before the old one gets terminated on the same node, correct?
For example, let's say you have 3 nodes, maxSurge=1, maxUavailable=0. Node 1 has 2 pods, and each of the other 2 nodes has 1 pod (4 pods in total). Then you can only delete pods on node 1, but not the other two.
If this is the requirement, this is not just about pods anymore, you need to mention nodes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Clarified.
`RollingUpdate` strategy only deletes pods up to the `maxUnavailable` threshold | ||
and then waits for the main reconciler create new pods to replace them). | ||
|
||
One minor complication is that the controller-manager's reconciliation loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That bug is unexpected and was recently introduced by kubernetes/kubernetes#43077 (1.8). This behavior (don't create new pods until the old one is terminated) has been supported by DaemonSet controller for a long time.
`maxUnavailable` to be > 0, but this is not required with | ||
`SurgingRollingUpdate`. | ||
|
||
There are plans to create a `v1beta2` iteration of the extensions API, so it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be apps/v1beta2
. Since apps/v1beta2
is introduced in 1.8, this validation change can be made in apps/v1beta2
only if it's implemented in 1.8.
may be possible to add the `maxSurge` strategy to `RollingUpdate` and deprecate | ||
`SurgingRollingUpdate` then. | ||
|
||
For more background see https://github.com/kubernetes/kubernetes/issues/48841. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Multiple DaemonSets should be listed as an alternative here, too.
strategy (as is the case for Deployments). However, this would break backwards | ||
compatibility, since the `maxUnavailable` parameter currently requires | ||
`maxUnavailable` to be > 0, but this is not required with | ||
`SurgingRollingUpdate`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, most DaemonSet applications can't tolerate two pods running at the same time. That's why RollingUpdate
is implemented that way (delete first, then create).
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, hostPort conflicts need to be handled for SurgingRollingUpdate
to be useful.
This document presents a design for a new "add first, then delete" rolling | ||
update strategy, called `SurgingRollingUpdate`. This strategy will create new | ||
pods first, cleaning up the old pods once the new ones are running. As with | ||
Deployments, users will use a `maxSurge` parameter to control how many nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if maxSuge is zero? Would that interrupt the algorithm in this paper? Also what is the default value for maxSuge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is 1, and it must be greater than 0. Updated the doc to say as much.
|
||
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be documented and highlighted.m The documentation can point the user to use the "delete before add" model.
Also, it is possible to test for a pod having a hostPath and refusing to do a "add before delete" option? (updated: it is noted in the document on line 136)
2. Note in the documentation that this may cause port conflicts that prevent | ||
new pods from scheduling successfully (and therefore updates from | ||
completing successfully) | ||
3. Add a validation check that rejects the combination of `hostPort` and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document presents a design for a new "add first, then delete" rolling | ||
update strategy, called `SurgingRollingUpdate`. This strategy will create new | ||
pods first, cleaning up the old pods once the new ones are running. As with | ||
Deployments, users will use a `maxSurge` parameter to control how many nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is 1, and it must be greater than 0. Updated the doc to say as much.
|
||
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the documentation once we come to a decision about what (if any) constraints will be checked.
I think the two proposed approaches still in contention are:
- Add very visible documentation saying that certain classes of resource types (such as host networking, host ports, etc) may lead to unschedulable situations when using this strategy.
- Add documentation as in 1, but also do a hard validation check to disallow using this strategy in conjunction with
hostPorts
specifically.
I am becoming convinced that option 2 is somewhat reasonable since the DaemonSet controller has an explicit scheduling check for hostPort contention. But I would not be in favor of adding checks for any other resource types, at least with the initial implementation.
To expand a bit: every pod has a phase change on deletion:
ReplicaSets create pods at 2 always because they are replicas (we provide "at-least 1" semantics). StatefulSets create pods at step 4 because they are strongly tied to pod identity (we provide "at-most 1" semantics). DaemonSets are in the middle - they are not replicas, but they are not pod identity. Instead, they are tied to node-identity. The real decision that a user has to make is whether they can tolerate at-most-1 or at-least-1. The issue triggered by kubernetes/kubernetes#50477 was a user who cared about at-most-1. This issue was triggered by a user who needs at-least-1. Both are valid use cases. The behavior in a distributed system that is required to guarantee at-most-1 is pretty aggressive - I don't think we currently guarantee that with daemonsets (for instance, just run two controller managers at the same time and they will race to create multiple pods on the node). So the user in 50477 is almost certainly in for a rude awakening at some point when they depend on at most one and we don't provide it. In a sense, that's a violation of #124 and we should fix that as soon as we can. I think we probably need to either commit to giving the users a choice, or to EXPLICITLY always be at-least-1 so that users aren't surprised when we pretend to be at-most-1 but then fail to do so. Rule of thumb is that pretending to be at-most-1 is the worst thing you can do - there is no "try". The upgrade strategy is subordinate to this choice - a rolling update strategy has to respect at most 1 or at least 1, but is probably free to do the right thing as long as the guarantee isn't violated. Any workload that supports at-least-1 probably wants the surge behavior on rolling update, no matter what. |
I'm not totally read in to this stuff but the For |
Yes, I was under the assumption that no part of k8s guaranteed I also agree that it's orthogonal to the strategy; Going back to this proposal, the EDIT to clarify my last point above: With the existing strategy the scheduler first sets a pod to terminating. Then it schedules a new pod on that node. Even with a long timeout, something could happen between those two steps in which the old pod terminates. If that pod was the only controller manager then there is nothing left to schedule the new pod. In the new strategy the new pod would be scheduled before setting the old one to terminating. Even if the new pod doesn't come up before the old one completely terminates, it should eventually (it comes down to the node at that point). It's not a guarantee but it removes a major point of failure. |
If we decide that daemonsets should be at-least-1, then there is no real need for the existing update strategy to be doing what it is doing today. In which case surge should be part of the existing strategy like ReplicaSets and we don't need or want a new strategy, we just want rolling update to do the right thing. If we are going to support at-most-1 we need to decide on that before we go to v1. I would argue anyone can implement at-most-one on a node by using a shared filesystem lock. While that requires a shared hostPath volume, it's a common daemonset use case. Therefore, that's an argument that we should be at-least-one because we support a workaround. What's clear is that this discussion blocks taking daemonset to GA. We need to follow up and ensure we reach closure on something this fundamental before going GA. |
@smarterclayton agree that this could be implemented by adding The underlying motivation here was to achieve parity with the Deployment rolling update strategy, which has both params. |
Two clarifications:
The best thing now is for interested parties to make a case for either at-most-1 as a feature or as a supported workaround as Joe noted. That will clarify next steps - reverting / altering 50477, an API change to DS to let a choice be possible, or delaying DS GA until we can make at-most-1 correct. |
FWIW kubernetes/kubernetes#50477 is a regression (caused by kubernetes/kubernetes#43077). DaemonSet controller had StatefulSet uses index as part of pod name to guarantee at-most-1 pod per index. A possible way to guarantee at-most-1 pod per node for DaemonSet is to use node name as part of DaemonSet pod's name, but the pod name might become too long. |
I don't think this decision should gate v1. DaemonSet controller was always designed to attempt at most one semantics, but I believe we could loosen or tighten this constraint as a feature, post GA, without breaking backwards compatibility. For at most one:
For at least once:
Also, We should not revert #50477. This reverted a change that introduced a bug where DaemonSet controller would launch Pods while a current Pod was still terminating. Reverting this will not implement maxSurge behavior. |
Looking at the code we'd need to tighten at least a few things to guarantee that (as you mentioned with node name). But I am a -1 on declaring something GA that has even the potential to violate a user assumption or violate pod safety guarantees, and that's where we are today. My working assumption was that daemonset is at-least-one because we are using generateName, and you can't use generateName and get at-most-one. The statements that it's at-most-one is surprising to me.
at-most-1 or at-least-1 is not about update strategy. It's a spec property independent of update. In the absence of an update during a drain or eviction the daemonset should follow the same property. It's fundamental to daemonset, the same way pod management policy applies regardless of update strategy.
Agree, but not-quite at-most-1 is also not the right thing to ship. By revert I meant, "don't almost give at-most-1 as the user who requested #50477 raised".
I think cluster-lifecycle's position is that it's actually not what they need, and other issues (cursory search includes kubernetes/kubernetes#51974 and kubernetes/kubernetes#48307) makes it less than clear. Looking at my own use cases, I have a small number that want at-most-one and a much larger number that want at-least-one for rolling update.
I don't think this is the right mindset for providing an API to users based on our previous history here on statefulsets and deployments. Providing an ALMOST correct guarantee creates a scenario where users expect it to work except when it doesn't. That means it fails at the worst time and either eats data or blows up a system. Either we implement at-most-1, or we implement at-least-1 and tell people how to get at-most-1. We've never shied away from doing the right thing for workload APIs before, not sure why we should do it here. I'd probably be ok with a v1 that did at-least-1 that could be turned to almost-at-most-1 with caveats and warnings, where in the future we'd implement correct at-most-1. I can't reconcile shipping almost-at-most-1 only given #124 and history with statefulset. |
Quick comment: Any application that really needs at most one on a node (e.g., because multiple instances would otherwise trash persistent data on local disk) absolutely has to use a file lock or other OS locking mechanism. We should document that. K8s cannot and should not provide application-level exclusion/handshaking. |
|
This is not at-most-one (to be clear for other participants in this thread) because any interruption or race in the controller can and will spawn two containers. This is "best effort at most one".
I don't follow. Nodes don't release resources until a pod is stopped, so a new pod launched by the daemonset to replace an evicted pod will be held up by resource release in the kubelet (or should, since this is a condition only the node can validate).
There is a reasonable expectation among a wide range of kubernetes users I have interacted with that kubernetes should work to avoid application downtime by being aggressive about scale up. Users have the same expectation with daemon sets.
Port conflicts are irrelevant since they can be avoided by the user omitting the port definition on the pod. A user today can remove the port conflict and do port handoff themselves with SO_REUSEPORT, but the current behavior of the daemonset prevents that. I don't know of any other conflict that impacts normal users, and omitting the pod's port definition is not burdensome to users if they can achieve availability. So the blocker is still the daemonset controller imposing at-most-one-ish.
The point though is that giving the appearance of a robust solution deters users from being actually robust which leads to application failure. We have documented and agreed on not violating user expectation previously.
Since the self-hosting folks are saying there's a gap, this seems a weak argument. V1 is EXACTLY when we should be confident it works. A perceived pressure to declare workload APIs v1 means very little to me if the semantics will lead to failure. I don't care whether workload v1 hits in 1.9 or 1.10, I care very much whether my production users are justified in their trust that we do the right thing for them. |
Correct, to be clear, as stated above, its just about respecting termination grace.
If we create another Pod on the same node while the current Pod is modifying some host resource, say the file system, as part of its termination, while the new Pod is attempting to utilize that resource it is surprising to users as in #50477.
The issue is that when users run a DaemonSet that provides a shared resource on the node that is utilized by many Pods (e.g. log shipping, distributed file system, metrics collection), the current recommend way of communicating with the Pod is via a well known host port. Many applications (e.g. HDFS ) don't allow the user to configure SO_REUSEPORT. This means that updating the Pods in the DaemonSet is inherently disruptive to all of its client Pods. Local services would provide a layer of abstraction that allows a surge update strategy to be applied in such a way that it is not disruptive to the node local client Pods.
Agree that we need to clearly document the behavior and limitations, and if that documentation is presently lacking, no matter what else we do, we need to correct it for users of the current implementation, but I don't see how promoting what is currently implemented, documented, and in use on production clusters can violate the users expectations.
I strongly agree with this. Which is why I think we should not modify the existing semantics that have been in place since the original design, or ship something drastically different than the DaemonSet that we use for our addons and that users have come to rely on in their production systems.
After talking to the self hosting folks, they indicated they can achieve their objective by creating a new DaemonSet prior to deleting the previous one. This is the way that many users achieve blue-green for DaemonSets and StatefulSets today. Given that those users have a path forward, and that we can add this feature later to further improve their experience, I think we should prioritize providing our users, and our own cluster addons, stability over new features. |
I agree that user surprise should be avoided, and I might even agree that the correct default for DS is at-most-one. On the other hand, we often slap people in Kubernetes hard the first time they use a particular controller specifically so that they know what they can rely on (the hard stop on graceful termination, the crash loop, the liveness check, replica scale out) up front without finding out later. And still, the number of users that assumed that replica set with scale = 1 means "at most one" is terrifying, but all of them were at least guided to the correct knowledge when their databases didn't scale up because the old pod was tearing down, or they saw two pods taking traffic. If we had a really clear path to at-most-one correctly in the near term that wouldn't break backward compatibility and we were positive wouldn't require API changes to achieve, I'm less worried. I'd like to spend our time trying to figure out how that can be provided.
Yeah, I'm just mostly thinking of things using host port networking or are the ones providing services (i.e. kube-proxy, dns, sdn) that can't rely on that.
If the pod watch is delayed and a resync is fired, it looks like the daemonset controller can create two pods on a node (from a cursory look and previous controller experience). Is that incorrect? Delayed pod watches have occurred in the last several etcd versions due to various bugs, and we have stomped several bugs in caches where they can also occur (beyond just normal distributed systems). If I have a pod that I have written that assumes that only one pod runs on a node at a time, nothing in Kube today pushes me to use filesystem locking because I'm trusting kube.
I'm fine with that. I'm still concerned about the previous point. |
It looks like all that protects the daemonset from creating multiple pods on a node is the expectation system, which is single process based.
|
@smarterclayton we discussed this in SIG Apps. The advice at this point was to take it to SIG Architecture in office hours this week. Would be nice to get some unblocking decisions there. |
Spawned the last part (daemonset not being resistant to controller restart and slow caches) as kubernetes/kubernetes#54096 and summarized the discussion here. Otherwise, this is on hold for 1.9 given sig-cluster-lifecycle having an alternative (correct?). |
@smarterclayton yes, we discussed in sig-cluster-lifecycle today and the plan is to move forward with an alternative. Personally, I'm in agreement that coming to a conclusion on the supported semantics of daemonsets should precede this discussion; once that is done then it would be nice to see if there is room for a Once that is settled other issues brought up here (scarce resources such as hostPorts, etc) can be addressed. |
|
||
### Considerations / questions | ||
|
||
1. How are `hostPort`s handled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since such use case brought me here I'm wondering if my use case and similar onces could be solved by using a special form of NodePort that doesn't redirect to any pod but only pods on the same node. This would avoid having to use hostPort in many cases where all you want is one pod per node that is reachable on a well known port.
I know this is beyond the scope of this, but knowing how a solution to this hostPort scenario likely would look like in the future might influence this design here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you didn't know about these:
In 1.5 and newer you can use the service.beta.kubernetes.io/external-traffic=OnlyLocal
annotation, and it seems like in newer versions you can do kubectl patch svc nodeport -p '{"spec":{"externalTrafficPolicy":"Local"}}'
to only allow local traffic.
This PR hasn't been active in 30 days. It will be closed in 59 days (Feb 10, 2018). cc @diegs @janetkuo @liggitt @smarterclayton You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days |
Closing this because I'm no longer working on it and I'm not aware of what the expected future will be. We can re-open or re-visit at a later time if needed. |
Why is this being abandoned ? seems like adding maxSurge would be ideal and at parity with Deployments |
I would like to know what should be done to put that suggestion back on the table? As for concrete user examples I would like to suggest this. In a matter of HA for such agent, it event might be desirable to have 2 agents on a node constantly, in order to mitigate that agent unavailability or crash. |
We have another usecase of running ingress controller as a daemonset on a separate set of nodes. Version Updates should not result in a scenario where a node does not have a single pod running, but it currently does. |
* Add Zhuzhenghao as member Signed-off-by: zhuzhenghao <zhenghao.zhu@daocloud.io> * Update members.yaml --------- Signed-off-by: zhuzhenghao <zhenghao.zhu@daocloud.io> Co-authored-by: Craig Box <craig.box@gmail.com>
This document captures the results of the discussion here:
kubernetes/kubernetes#48841
Relevant feature:
kubernetes/enhancements#373
cc @kow3ns @erictune @roberthbailey @janetkuo @luxas @lpabon @aaronlevy @kubernetes/sig-cluster-lifecycle-feature-requests @kubernetes/sig-apps-feature-requests