Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "DaemonSet create-first rolling update" design proposal. #977

Closed
wants to merge 4 commits into from

Conversation

diegs
Copy link

@diegs diegs commented Aug 24, 2017

This document captures the results of the discussion here:

kubernetes/kubernetes#48841

Relevant feature:

kubernetes/enhancements#373

cc @kow3ns @erictune @roberthbailey @janetkuo @luxas @lpabon @aaronlevy @kubernetes/sig-cluster-lifecycle-feature-requests @kubernetes/sig-apps-feature-requests

This document captures the results of the discussion here:

kubernetes/kubernetes#48841

Relevant feature:

kubernetes/enhancements#373
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. kind/feature Categorizes issue or PR as related to a new feature. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Aug 24, 2017
@k8s-ci-robot
Copy link
Contributor

@diegs: Reiterating the mentions to trigger a notification:
@kubernetes/sig-cluster-lifecycle-feature-requests, @kubernetes/sig-apps-feature-requests.

In response to this:

This document captures the results of the discussion here:

kubernetes/kubernetes#48841

Relevant feature:

kubernetes/enhancements#373

cc @kow3ns @erictune @roberthbailey @janetkuo @luxas @lpabon @aaronlevy @kubernetes/sig-cluster-lifecycle-feature-requests @kubernetes/sig-apps-feature-requests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

then a rolling update of the controller-manager would delete all the old pods
before the new pods were scheduled, leading to a control plane outage.

This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a blocker with a HA master node setup with maxUnavailable set to 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't affect HA deployments given all components talk to a loadbalanced endpoint of any kind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diegs We could point out that this is for the one-master case only, but that it is a generally useful feature, not a kubeadm specific thing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


### Considerations / questions

1. How are `hostPort`s handled?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't hostPorts likely to be common in self-hosted scenarios (one of the driving use cases for this proposal)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is simply incompatible with hostPorts, barring some really clever work that isn't on any roadmap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt In kubeadm's case; no. We're using DaemonSets with hostNetworking (due to an other scheduling bug)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using DaemonSets with hostNetworking

if you are using host networking, won't your new containers still be unable to bind to the same ports (since they'd still be held by the old container), and never go healthy, and fail upgrade?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO_REUSEPORT ?

What I alluded to with "really clever work" would be to emulate that in the port-forwarding path, but that sounds disastrous.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that special cases for hostPorts, hostNetworking, and other scarce resources like GPUs should not be a goal here. In particular, the behavior you describe is also true of Deployments.

For example (and I tested this), create a deployment where replicas = worker nodes, and use a hostport. Set maxUnavailable to 0, and then update your deployment. The new pods will be stuck in pending since the scheduler can't place the new pods anywhere.

This behavior is also true for DaemonSets as you point out (updated in the doc). And yes, I can see how DaemonSets may more often use hostPorts / hostNetworking in practice, but I think the scheduler's job is to place pods subject to the intersection of their resource constraints, not do resource planning for you.

We should absolutely call out this risk in the documentation. But I don't think that it should be this code's job to try to foresee potential resource conflicts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically agree. I don't think it is tractable to try to fix this special case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be documented and highlighted.m The documentation can point the user to use the "delete before add" model.

Also, it is possible to test for a pod having a hostPath and refusing to do a "add before delete" option? (updated: it is noted in the document on line 136)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the documentation once we come to a decision about what (if any) constraints will be checked.

I think the two proposed approaches still in contention are:

  1. Add very visible documentation saying that certain classes of resource types (such as host networking, host ports, etc) may lead to unschedulable situations when using this strategy.
  2. Add documentation as in 1, but also do a hard validation check to disallow using this strategy in conjunction with hostPorts specifically.

I am becoming convinced that option 2 is somewhat reasonable since the DaemonSet controller has an explicit scheduling check for hostPort contention. But I would not be in favor of adding checks for any other resource types, at least with the initial implementation.

@thockin thockin assigned liggitt and janetkuo and unassigned thockin Aug 24, 2017

1. How are `hostPort`s handled?

They are not handled as part of this proposal. We can either:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing non working configuration is a bad idea. I think option 3 should be implemented.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, you can reach this state with Deployments too. I'll defer on picking an answer to this question until we resolve that discussion.

then a rolling update of the controller-manager would delete all the old pods
before the new pods were scheduled, leading to a control plane outage.

This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diegs We could point out that this is for the one-master case only, but that it is a generally useful feature, not a kubeadm specific thing.

### Alternatives considered

The `maxSurge` parameter could have been added to the existing `RollingUpdate`
strategy (as is the case for Deployments). However, this would break backwards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break validation backwards compability

Please clarify that while all this would have been technically possible due to the shift to v1beta2; it was discarded just to be on the safe side.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


### Considerations / questions

1. How are `hostPort`s handled?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt In kubeadm's case; no. We're using DaemonSets with hostNetworking (due to an other scheduling bug)


They are not handled as part of this proposal. We can either:

1. Attempt to determine a mechanism to handle this (unclear)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point out that this may be a future improvement, but will not be taken into consideration right now.
Possibly for GA?

2. Note in the documentation that this may cause port conflicts that prevent
new pods from scheduling successfully (and therefore updates from completing
successfully)
3. Add a validation check that rejects the combination of `hostPort` and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 please point out as the accepted proposal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host ports are only one possible scarce resource (albeit the most likely). validating just that still allows for impossible-to-satisfy updates involving scarce resources, which is worse for daemonsets than for deployments because there is no possibility of finding another node with the available resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember DaemonSet controller checks only for HostPorts conflicts. It will not allow new pod to run on a node and it will cause dead lock.


TODO(diegs): investigate this.

3. How are other shared resources (e.g. local devices, GPUs, specific core
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do deployments handle this?

Copy link
Member

@liggitt liggitt Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they rely on the scheduler to locate nodes with available resources

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and if you think of DaemonSets as a special case of a Deployment with numReplicas = numNodes and antiaffinity then the scheduling will degenerate in the same way if you are using scarce resources.

It would require a lot of special-case engineering to work around all possible issues, and I'm sure there will always be more.

pods first, cleaning up the old pods once the new ones are running. As with
Deployments, users will use a `maxSurge` parameter to control how many nodes
are allowed to have >1 pod running during a rolling update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the modifications that you intend to make to the API. Generally, we add the types and type modifications that we intend to make as part of the design. The design docs for StatefulSet and DaemonSet RollingUpdate are examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

`maxUnavailable` to be > 0, but this is not required with
`SurgingRollingUpdate`.

There are plans to create a `v1beta2` iteration of the extensions API, so it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v1beta2 API is enabled by default on master. We want to promote that surface to GA in 1.9 provided we get good feedback on the v1beta2 surface in 1.8.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted

may be possible to add the `maxSurge` strategy to `RollingUpdate` and deprecate
`SurgingRollingUpdate` then.

For more background see https://github.com/kubernetes/kubernetes/issues/48841.
Copy link
Member

@kow3ns kow3ns Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is your proposal a better alternative to creating a new DaemonSet and only deleting the current DaemonSet after the first has saturated the desired number of Nodes? Can you discuss this more here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this strategy? Do you mean having the user create a new DaemonSet with a different name, and then deleting this one?

If so I can think of many reasons (retaining history, minimizing clutter, actually being able to introspect the status of the update, etc.) but I want to make sure I understand which you mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Multiple DaemonSets should be listed as an alternative here, too.

### Considerations / questions

1. How are `hostPort`s handled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its acceptable to completely punt on hostPort conflicts. They are the most common way of communicating with applications in a DaemonSet. Using validation to block the creation of DaemonSets with this updateStrategy (as you suggest below) is one option, but it really limits the usefulness of the updateStrategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, hostPort conflicts need to be handled for SurgingRollingUpdate to be useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo handled in what sense? Can you comment on the discussion immediately above this one? We could block it in validation (since I guess the code already has a special case to block it from a scheduling perspective), but besides that I'm not sure what else we can do.

3. Add a validation check that rejects the combination of `hostPort` and the
`SurgingRollingUpdate` strategy.

2. How will the scheduler handle hostPort collisions?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless application designers implement clever coordination, it will attempt to schedule both Pods, and one of them will continuously fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@kow3ns kow3ns Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that true for the default scheduler as well (Assuming me move to it one day)?

Copy link
Contributor

@lukaszo lukaszo Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for the default scheduler pod will be created and will be in the Pending state

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the behavior I observed.

`RollingUpdate` strategy only deletes pods up to the `maxUnavailable` threshold
and then waits for the main reconciler create new pods to replace them).

One minor complication is that the controller-manager's reconciliation loop
Copy link
Member

@kow3ns kow3ns Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually true. One Pod per node is best effort. In contrast to this proposal, which would relax this constraint, the community has requested exactly one Pod per node to provide better guarantees with respect to the overlap of lifecycles of the Pods in their DeamonSets. Right now it is possible for DaemonSet to schedule a new Pod while another Pod is still terminating. We consider this to be a bug (kubernetes/kubernetes#50477).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified that the invariant is not 100% enforced yet but it is a goal.

This strategy is incompatible with the invariant, so either we need to relax it during updates as mentioned, or not implement this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That bug is unexpected and was recently introduced by kubernetes/kubernetes#43077 (1.8). This behavior (don't create new pods until the old one is terminated) has been supported by DaemonSet controller for a long time.

3. How are other shared resources (e.g. local devices, GPUs, specific core
types) handled?

This is a generalization of the `hostPort` question. I think that in this case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that consuming GPU and specific core types can be considered out of scope. It is not a common use case for DaemonSets. For hostPath volumes, you could perhaps advise using filesystem locks to prevent data corruption when applications use this update strategy. If you combine filesystem locks, a burden on application implementors (at least initially), with a robust solution to hostPort conflicts, you have something that most applications can consume.


This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`.

### Implementation plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things which aren't clear to me.

  1. When can I select this strategy? Does it have to be set on creation? Can I change the updateStrategy from RollingUpdate to SurgeRollingUpdate (currently updateStrategy is mutable)?
  2. How will the controller handle the case where the updateStrategy is mutated to SurgeRollingUpdate when RollingUpdate was previously selected and an update is in progress.
  3. How will the controller handle the case where the updateStrategy is mutated to RollingUpdate and SurgeRollingUpdate was previously selected and an update is in progress.
  4. Is this on or off by default? Do you intend to use feature gate flags make this an alpha feature, or will all clusters immediately get this feature?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added answers in the "Considerations / questions" section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding alpha/beta/feature gate, I defer to @luxas and others.

`maxUnavailable` to be > 0, but this is not required with
`SurgingRollingUpdate`.

There are plans to create a `v1beta2` iteration of the extensions API, so it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are plans to create a v1beta2 iteration of the extensions API

do you mean v1beta2 of the apps API group? The goal is to eliminate the extensions API group, not perpetuate it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be apps/v1beta2. Since apps/v1beta2 is introduced in 1.8, this validation change can be made in apps/v1beta2 only if it's implemented in 1.8.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 25, 2017
Copy link
Author

@diegs diegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments and updated doc.

I used a new commit so we wouldn't lose the comment threads. I can squash before I submit.

pods first, cleaning up the old pods once the new ones are running. As with
Deployments, users will use a `maxSurge` parameter to control how many nodes
are allowed to have >1 pod running during a rolling update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

then a rolling update of the controller-manager would delete all the old pods
before the new pods were scheduled, leading to a control plane outage.

This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`.

### Implementation plan
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added answers in the "Considerations / questions" section.


This feature is a blocker for supporting self-hosted Kubernetes in `kubeadm`.

### Implementation plan
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding alpha/beta/feature gate, I defer to @luxas and others.

`RollingUpdate` strategy only deletes pods up to the `maxUnavailable` threshold
and then waits for the main reconciler create new pods to replace them).

One minor complication is that the controller-manager's reconciliation loop
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified that the invariant is not 100% enforced yet but it is a goal.

This strategy is incompatible with the invariant, so either we need to relax it during updates as mentioned, or not implement this.


### Considerations / questions

1. How are `hostPort`s handled?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that special cases for hostPorts, hostNetworking, and other scarce resources like GPUs should not be a goal here. In particular, the behavior you describe is also true of Deployments.

For example (and I tested this), create a deployment where replicas = worker nodes, and use a hostport. Set maxUnavailable to 0, and then update your deployment. The new pods will be stuck in pending since the scheduler can't place the new pods anywhere.

This behavior is also true for DaemonSets as you point out (updated in the doc). And yes, I can see how DaemonSets may more often use hostPorts / hostNetworking in practice, but I think the scheduler's job is to place pods subject to the intersection of their resource constraints, not do resource planning for you.

We should absolutely call out this risk in the documentation. But I don't think that it should be this code's job to try to foresee potential resource conflicts.

### Considerations / questions

1. How are `hostPort`s handled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.


1. How are `hostPort`s handled?

They are not handled as part of this proposal. We can either:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, you can reach this state with Deployments too. I'll defer on picking an answer to this question until we resolve that discussion.

3. Add a validation check that rejects the combination of `hostPort` and the
`SurgingRollingUpdate` strategy.

2. How will the scheduler handle hostPort collisions?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the behavior I observed.


TODO(diegs): investigate this.

3. How are other shared resources (e.g. local devices, GPUs, specific core
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and if you think of DaemonSets as a special case of a Deployment with numReplicas = numNodes and antiaffinity then the scheduling will degenerate in the same way if you are using scarce resources.

It would require a lot of special-case engineering to work around all possible issues, and I'm sure there will always be more.

//---
// TODO: Update this to follow our convention for oneOf, whatever we decide it
// to be. Same as Deployment `strategy.rollingUpdate`.
// See https://github.com/kubernetes/kubernetes/issues/35345
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indentation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

strategy.

This document presents a design for a new "add first, then delete" rolling
update strategy, called `SurgingRollingUpdate`. This strategy will create new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this new DaemonSet update strategy is similar to Deployment's RollingUpdate (i.e. MaxSurge & MaxUnavailability), it can support both "delete first" and "add first" (depending on whether MaxUnavailability is 0 or not).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is noted in Alternatives Considered below. It was decided in kubernetes/kubernetes#48841 to add a new strategy so validation would remain backwards-compatible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a great idea to converge on strategies across all controllers. We could add MaxSurge and MaxUnavailability and keep the default as MaxSurge=0 and MaxUnavailability as 1 to keep it backward compatible. @diegs how does this break validation ?

2. Determine how many previous-generation pods can be retired once their
replacement pods are scheduled and running.
3. Instruct the controller-manager to add and delete pods based on the results
of (1) and (2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SurgingRollingUpdate will create a new DaemonSet pod being created before the old one gets terminated on the same node, correct?

For example, let's say you have 3 nodes, maxSurge=1, maxUavailable=0. Node 1 has 2 pods, and each of the other 2 nodes has 1 pod (4 pods in total). Then you can only delete pods on node 1, but not the other two.

If this is the requirement, this is not just about pods anymore, you need to mention nodes here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Clarified.

`RollingUpdate` strategy only deletes pods up to the `maxUnavailable` threshold
and then waits for the main reconciler create new pods to replace them).

One minor complication is that the controller-manager's reconciliation loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That bug is unexpected and was recently introduced by kubernetes/kubernetes#43077 (1.8). This behavior (don't create new pods until the old one is terminated) has been supported by DaemonSet controller for a long time.

`maxUnavailable` to be > 0, but this is not required with
`SurgingRollingUpdate`.

There are plans to create a `v1beta2` iteration of the extensions API, so it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be apps/v1beta2. Since apps/v1beta2 is introduced in 1.8, this validation change can be made in apps/v1beta2 only if it's implemented in 1.8.

may be possible to add the `maxSurge` strategy to `RollingUpdate` and deprecate
`SurgingRollingUpdate` then.

For more background see https://github.com/kubernetes/kubernetes/issues/48841.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Multiple DaemonSets should be listed as an alternative here, too.

strategy (as is the case for Deployments). However, this would break backwards
compatibility, since the `maxUnavailable` parameter currently requires
`maxUnavailable` to be > 0, but this is not required with
`SurgingRollingUpdate`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, most DaemonSet applications can't tolerate two pods running at the same time. That's why RollingUpdate is implemented that way (delete first, then create).

### Considerations / questions

1. How are `hostPort`s handled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, hostPort conflicts need to be handled for SurgingRollingUpdate to be useful.

This document presents a design for a new "add first, then delete" rolling
update strategy, called `SurgingRollingUpdate`. This strategy will create new
pods first, cleaning up the old pods once the new ones are running. As with
Deployments, users will use a `maxSurge` parameter to control how many nodes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if maxSuge is zero? Would that interrupt the algorithm in this paper? Also what is the default value for maxSuge?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is 1, and it must be greater than 0. Updated the doc to say as much.


### Considerations / questions

1. How are `hostPort`s handled?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be documented and highlighted.m The documentation can point the user to use the "delete before add" model.

Also, it is possible to test for a pod having a hostPath and refusing to do a "add before delete" option? (updated: it is noted in the document on line 136)

2. Note in the documentation that this may cause port conflicts that prevent
new pods from scheduling successfully (and therefore updates from
completing successfully)
3. Add a validation check that rejects the combination of `hostPort` and the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Author

@diegs diegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo @liggitt any chance we can close this out so we can move back over to the PR?

I think the main open questions are:

  • Whether we add special validation for hostPorts
  • Whether this is feature gated somehow, or classified as alpha / beta

Thanks,
Diego

This document presents a design for a new "add first, then delete" rolling
update strategy, called `SurgingRollingUpdate`. This strategy will create new
pods first, cleaning up the old pods once the new ones are running. As with
Deployments, users will use a `maxSurge` parameter to control how many nodes
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is 1, and it must be greater than 0. Updated the doc to say as much.


### Considerations / questions

1. How are `hostPort`s handled?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the documentation once we come to a decision about what (if any) constraints will be checked.

I think the two proposed approaches still in contention are:

  1. Add very visible documentation saying that certain classes of resource types (such as host networking, host ports, etc) may lead to unschedulable situations when using this strategy.
  2. Add documentation as in 1, but also do a hard validation check to disallow using this strategy in conjunction with hostPorts specifically.

I am becoming convinced that option 2 is somewhat reasonable since the DaemonSet controller has an explicit scheduling check for hostPort contention. But I would not be in favor of adding checks for any other resource types, at least with the initial implementation.

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 12, 2017

To expand a bit: every pod has a phase change on deletion:

  1. running
  2. graceful termination
  3. terminated
  4. removed from etcd

ReplicaSets create pods at 2 always because they are replicas (we provide "at-least 1" semantics). StatefulSets create pods at step 4 because they are strongly tied to pod identity (we provide "at-most 1" semantics). DaemonSets are in the middle - they are not replicas, but they are not pod identity. Instead, they are tied to node-identity. The real decision that a user has to make is whether they can tolerate at-most-1 or at-least-1. The issue triggered by kubernetes/kubernetes#50477 was a user who cared about at-most-1. This issue was triggered by a user who needs at-least-1. Both are valid use cases.

The behavior in a distributed system that is required to guarantee at-most-1 is pretty aggressive - I don't think we currently guarantee that with daemonsets (for instance, just run two controller managers at the same time and they will race to create multiple pods on the node). So the user in 50477 is almost certainly in for a rude awakening at some point when they depend on at most one and we don't provide it. In a sense, that's a violation of #124 and we should fix that as soon as we can.

I think we probably need to either commit to giving the users a choice, or to EXPLICITLY always be at-least-1 so that users aren't surprised when we pretend to be at-most-1 but then fail to do so. Rule of thumb is that pretending to be at-most-1 is the worst thing you can do - there is no "try".

The upgrade strategy is subordinate to this choice - a rolling update strategy has to respect at most 1 or at least 1, but is probably free to do the right thing as long as the guarantee isn't violated. Any workload that supports at-least-1 probably wants the surge behavior on rolling update, no matter what.

@jbeda
Copy link
Contributor

jbeda commented Oct 12, 2017

I'm not totally read in to this stuff but the at-most-1 vs. at-least-1 seems to be a useful way to break this down.

For at-least-1 we'd want to have some patterns well documented on how to do a graceful handoff from the old thing to the new thing. That could be just a graceful shutdown or it could also be a migration of data to the new thing to keep state around and alive. This is a bit of a "live migration" type of situation.

@diegs
Copy link
Author

diegs commented Oct 12, 2017

Yes, I was under the assumption that no part of k8s guaranteed at-most-1, and was surprised by kubernetes/kubernetes#50477. I completely agree that it will probably be a difficult constraint to truly guarantee in perpetuity.

I also agree that it's orthogonal to the strategy; maxUnavailable > 0 with maxSurge = 0 will terminate pods before scheduling new ones, and maxUnavailable = 0 with maxSurge > 0 will to schedule new pods before terminating old ones; whether there are actually 0, 1, or 2 pods running in either scenario is not guaranteed.

Going back to this proposal, the maxSurge parameter in the strategy as I implemented it was not affected by kubernetes/kubernetes#50477. But it goes back to what I said about preferring explicitly implementing the semantics we want instead of relying on implementation details that may change. In the surging strategy we will always attempt to schedule a new pod before setting the old one to terminating, so that gives a higher chance of success than waiting for the old pod to be in a terminating state before scheduling the new one, as is the behavior with the existing strategy.

EDIT to clarify my last point above:

With the existing strategy the scheduler first sets a pod to terminating. Then it schedules a new pod on that node. Even with a long timeout, something could happen between those two steps in which the old pod terminates. If that pod was the only controller manager then there is nothing left to schedule the new pod.

In the new strategy the new pod would be scheduled before setting the old one to terminating. Even if the new pod doesn't come up before the old one completely terminates, it should eventually (it comes down to the node at that point). It's not a guarantee but it removes a major point of failure.

@smarterclayton
Copy link
Contributor

If we decide that daemonsets should be at-least-1, then there is no real need for the existing update strategy to be doing what it is doing today. In which case surge should be part of the existing strategy like ReplicaSets and we don't need or want a new strategy, we just want rolling update to do the right thing.

If we are going to support at-most-1 we need to decide on that before we go to v1.

I would argue anyone can implement at-most-one on a node by using a shared filesystem lock. While that requires a shared hostPath volume, it's a common daemonset use case. Therefore, that's an argument that we should be at-least-one because we support a workaround.

What's clear is that this discussion blocks taking daemonset to GA. We need to follow up and ensure we reach closure on something this fundamental before going GA.

@diegs
Copy link
Author

diegs commented Oct 12, 2017

@smarterclayton agree that this could be implemented by adding maxSurge to the existing strategy; it was not to retain backwards-compatibility for validation.

The underlying motivation here was to achieve parity with the Deployment rolling update strategy, which has both params.

@smarterclayton
Copy link
Contributor

Two clarifications:

  1. Stateful sets guarantee at most one pod per identity and do so safely. We guarantee at most one pod with a given name is running on the cluster at any one time.
  2. Safety rules and ensuring user data is safe trumps backwards compatibility. If we determine our current semantics are confusing users that they get at most 1, it's within our remit to break users in order to make them safe.

The best thing now is for interested parties to make a case for either at-most-1 as a feature or as a supported workaround as Joe noted. That will clarify next steps - reverting / altering 50477, an API change to DS to let a choice be possible, or delaying DS GA until we can make at-most-1 correct.

@janetkuo
Copy link
Member

janetkuo commented Oct 12, 2017

FWIW kubernetes/kubernetes#50477 is a regression (caused by kubernetes/kubernetes#43077). DaemonSet controller had at-most-1 logic implemented long before kubernetes/kubernetes#50477. DaemonSet controller will not try to create DaemonSet pod on a node until it finds all the previous ones are gone.

StatefulSet uses index as part of pod name to guarantee at-most-1 pod per index. A possible way to guarantee at-most-1 pod per node for DaemonSet is to use node name as part of DaemonSet pod's name, but the pod name might become too long.

@kow3ns
Copy link
Member

kow3ns commented Oct 13, 2017

I don't think this decision should gate v1. DaemonSet controller was always designed to attempt at most one semantics, but I believe we could loosen or tighten this constraint as a feature, post GA, without breaking backwards compatibility.

For at most one:

  1. I'm not entirely convinced that trying to tighten the constraint (e.g. by incorporating the Node name in the generated Pod name) would be a breaking change if we turned this on in the general case.

    1. The general expectation is that the controller is already trying to achieve at most one.
    2. Name generation for DaemonSet Pods is random, so if we use the Node name to make the Pods Node unique it doesn't violate users' expectations (for DaemonSets the name of a Pod is an implementation detail, not a conformance requirement).
  2. If we think it is a breaking change, than we could make it opt in via the addition of a flag on the DaemonSet Spec (e.g. spec.nameGeneration=Random|Node) where the default uses random name generation.

  3. As pointed out above, applications can use advisory file system locks for tighter guarantees of uniqueness to achieve stricter semantics where necessary, so I don't think we have to provide stronger guarantees as a prerequisite for promotion.

For at least once:

  1. We can modify the update strategy to use maxSurge, or add a new UpdateStrategy. In either case the new behavior can be implemented without breaking backward compatibility.

Also, We should not revert #50477. This reverted a change that introduced a bug where DaemonSet controller would launch Pods while a current Pod was still terminating. Reverting this will not implement maxSurge behavior.

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 13, 2017

DaemonSet controller was always designed to attempt at most one semantics, but I believe we could loosen or tighten this constraint as a feature, post GA, without breaking backwards compatibility.

Looking at the code we'd need to tighten at least a few things to guarantee that (as you mentioned with node name). But I am a -1 on declaring something GA that has even the potential to violate a user assumption or violate pod safety guarantees, and that's where we are today. My working assumption was that daemonset is at-least-one because we are using generateName, and you can't use generateName and get at-most-one. The statements that it's at-most-one is surprising to me.

We can modify the update strategy to use maxSurge, or add a new UpdateStrategy

at-most-1 or at-least-1 is not about update strategy. It's a spec property independent of update. In the absence of an update during a drain or eviction the daemonset should follow the same property. It's fundamental to daemonset, the same way pod management policy applies regardless of update strategy.

Also, We should not revert #50477

Agree, but not-quite at-most-1 is also not the right thing to ship. By revert I meant, "don't almost give at-most-1 as the user who requested #50477 raised".

The general expectation is that the controller is already trying to achieve at most one.

I think cluster-lifecycle's position is that it's actually not what they need, and other issues (cursory search includes kubernetes/kubernetes#51974 and kubernetes/kubernetes#48307) makes it less than clear. Looking at my own use cases, I have a small number that want at-most-one and a much larger number that want at-least-one for rolling update.

As pointed out above, applications can use advisory file system locks for tighter guarantees of uniqueness to achieve stricter semantics where necessary, so I don't think we have to provide stronger guarantees as a prerequisite for promotion.

I don't think this is the right mindset for providing an API to users based on our previous history here on statefulsets and deployments. Providing an ALMOST correct guarantee creates a scenario where users expect it to work except when it doesn't. That means it fails at the worst time and either eats data or blows up a system. Either we implement at-most-1, or we implement at-least-1 and tell people how to get at-most-1. We've never shied away from doing the right thing for workload APIs before, not sure why we should do it here.

I'd probably be ok with a v1 that did at-least-1 that could be turned to almost-at-most-1 with caveats and warnings, where in the future we'd implement correct at-most-1. I can't reconcile shipping almost-at-most-1 only given #124 and history with statefulset.

@bgrant0607
Copy link
Member

Quick comment: Any application that really needs at most one on a node (e.g., because multiple instances would otherwise trash persistent data on local disk) absolutely has to use a file lock or other OS locking mechanism. We should document that. K8s cannot and should not provide application-level exclusion/handshaking.

@kow3ns
Copy link
Member

kow3ns commented Oct 13, 2017

I'd probably be ok with a v1 that did at-least-1 that could be turned to almost-at-most-1 with caveats and warnings, where in the future we'd implement correct at-most-1. I can't reconcile shipping almost-at-most-1 only given #124 and history with statefulset.

  1. What almost at most one means right now is that DaemonSet respects the termination grace period of the Pods that it created. Currently the DaemonSet controller will not launch a new Pod until any other Pod, that it has created on the same node, is completely deleted. A failure to correctly implement this was the root cause of #50477.

  2. As @smarterclayton points out above, this behavior is important for DaemonSets that use scarce resources as a new Pod may interfere with the termination lifecycle of the current Pod, and this can occur during unplanned termination (e.g. eviction due to taints). I don't think that it ever makes sense to modify this behavior for DaemonSet Pods during unplanned disruptions.

  3. For this proposal, and for #51974 (#48307 is about the interaction between Deployments and PDBs and I don't see why it's relevant here), the best way to deal with port conflicts would be to implement something like Node Local Services first. If we don't do this, I don't see how any modification to the existing strategy, or any new strategy that attempts to launch more than one Pod per Node, will provide value in the general case (i.e. It might work for self hosting, but it will not solve #51974).

  4. It sounds like we have broad agreement that advisory filesystem locking, or some other locking method, must be used to achieve strong guarantees about at most one semantics. If we want to give users the option of having DaemonSet controller generate a Node unique name to preclude the controller from creating two Pods on the same Node, we can do so using an addition to the spec as discussed above, but if at most one is critical, implementing that assurance via local, application level coordination will always provide a more robust solution.

  5. The DaemonSet feature was opened in Sept 2014, though it has evolved, its general design constraints haven't changed significantly since then. Given the number of addons that are currently using DaemonSet, and that we are even considering what needs to be done for self hosting, I think we have a strong signal, from our own usage, that the current implementation is mature enough for promotion. We can do (3), (4), or (3) and (4) after GA promotion without breaking backward compatibility.

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 14, 2017

What almost at most one means right now is that DaemonSet respects the termination grace period of the Pods that it created

This is not at-most-one (to be clear for other participants in this thread) because any interruption or race in the controller can and will spawn two containers. This is "best effort at most one".

this behavior is important for DaemonSets that use scarce resources as a new Pod may interfere with the termination lifecycle of the current Pod, and this can occur during unplanned termination (e.g. eviction due to taints

I don't follow. Nodes don't release resources until a pod is stopped, so a new pod launched by the daemonset to replace an evicted pod will be held up by resource release in the kubelet (or should, since this is a condition only the node can validate).

#48307 specifically comments from a user:

Another thing is a planned maintenance on a node: it should be fairly simple to make sure extra pods are started elsewhere before we shut down this node for maintenance

There is a reasonable expectation among a wide range of kubernetes users I have interacted with that kubernetes should work to avoid application downtime by being aggressive about scale up. Users have the same expectation with daemon sets.

Port conflicts ... If we don't do this, I don't see how any modification to the existing strategy, or any new strategy that attempts to launch more than one Pod per Node, will provide value in the general case

Port conflicts are irrelevant since they can be avoided by the user omitting the port definition on the pod. A user today can remove the port conflict and do port handoff themselves with SO_REUSEPORT, but the current behavior of the daemonset prevents that. I don't know of any other conflict that impacts normal users, and omitting the pod's port definition is not burdensome to users if they can achieve availability. So the blocker is still the daemonset controller imposing at-most-one-ish.

At most one through locking ... more robust

The point though is that giving the appearance of a robust solution deters users from being actually robust which leads to application failure. We have documented and agreed on not violating user expectation previously.

Given the number of addons that are currently using DaemonSet, and that we are even considering what needs to be done for self hosting...

Since the self-hosting folks are saying there's a gap, this seems a weak argument. V1 is EXACTLY when we should be confident it works. A perceived pressure to declare workload APIs v1 means very little to me if the semantics will lead to failure. I don't care whether workload v1 hits in 1.9 or 1.10, I care very much whether my production users are justified in their trust that we do the right thing for them.

@kow3ns
Copy link
Member

kow3ns commented Oct 14, 2017

This is not at-most-one (to be clear for other participants in this thread) because any interruption or race in the controller can and will spawn two containers. This is "best effort at most one".

Correct, to be clear, as stated above, its just about respecting termination grace.

I don't follow. Nodes don't release resources until a pod is stopped, so a new pod launched by the daemonset to replace an evicted pod will be held up by resource release in the kubelet (or should, since this is a condition only the node can validate).

If we create another Pod on the same node while the current Pod is modifying some host resource, say the file system, as part of its termination, while the new Pod is attempting to utilize that resource it is surprising to users as in #50477.

Port conflicts are irrelevant since they can be avoided by the user omitting the port definition on the pod. A user today can remove the port conflict and do port handoff themselves with SO_REUSEPORT,

The issue is that when users run a DaemonSet that provides a shared resource on the node that is utilized by many Pods (e.g. log shipping, distributed file system, metrics collection), the current recommend way of communicating with the Pod is via a well known host port. Many applications (e.g. HDFS ) don't allow the user to configure SO_REUSEPORT. This means that updating the Pods in the DaemonSet is inherently disruptive to all of its client Pods. Local services would provide a layer of abstraction that allows a surge update strategy to be applied in such a way that it is not disruptive to the node local client Pods.

The point though is that giving the appearance of a robust solution deters users from being actually robust which leads to application failure. We have documented and agreed on not violating user expectation previously.

Agree that we need to clearly document the behavior and limitations, and if that documentation is presently lacking, no matter what else we do, we need to correct it for users of the current implementation, but I don't see how promoting what is currently implemented, documented, and in use on production clusters can violate the users expectations.

V1 is EXACTLY when we should be confident it works.

I strongly agree with this. Which is why I think we should not modify the existing semantics that have been in place since the original design, or ship something drastically different than the DaemonSet that we use for our addons and that users have come to rely on in their production systems.

Since the self-hosting folks are saying there's a gap, this seems a weak argument.

After talking to the self hosting folks, they indicated they can achieve their objective by creating a new DaemonSet prior to deleting the previous one. This is the way that many users achieve blue-green for DaemonSets and StatefulSets today. Given that those users have a path forward, and that we can add this feature later to further improve their experience, I think we should prioritize providing our users, and our own cluster addons, stability over new features.

@smarterclayton
Copy link
Contributor

If we create another Pod on the same node while the current Pod is modifying some host resource, say the file system, as part of its termination, while the new Pod is attempting to utilize that resource it is surprising to users as in #50477.

I agree that user surprise should be avoided, and I might even agree that the correct default for DS is at-most-one. On the other hand, we often slap people in Kubernetes hard the first time they use a particular controller specifically so that they know what they can rely on (the hard stop on graceful termination, the crash loop, the liveness check, replica scale out) up front without finding out later. And still, the number of users that assumed that replica set with scale = 1 means "at most one" is terrifying, but all of them were at least guided to the correct knowledge when their databases didn't scale up because the old pod was tearing down, or they saw two pods taking traffic.

If we had a really clear path to at-most-one correctly in the near term that wouldn't break backward compatibility and we were positive wouldn't require API changes to achieve, I'm less worried. I'd like to spend our time trying to figure out how that can be provided.

Local services would provide a layer of abstraction that allows a surge update strategy to be applied in such a way that it is not disruptive to the node local client Pods.

Yeah, I'm just mostly thinking of things using host port networking or are the ones providing services (i.e. kube-proxy, dns, sdn) that can't rely on that.

I don't see how promoting what is currently implemented, documented, and in use on production clusters can violate the users expectations.

If the pod watch is delayed and a resync is fired, it looks like the daemonset controller can create two pods on a node (from a cursory look and previous controller experience). Is that incorrect? Delayed pod watches have occurred in the last several etcd versions due to various bugs, and we have stomped several bugs in caches where they can also occur (beyond just normal distributed systems). If I have a pod that I have written that assumes that only one pod runs on a node at a time, nothing in Kube today pushes me to use filesystem locking because I'm trusting kube.

indicated they can achieve their objective by creating a new DaemonSet prior to deleting the previous one

I'm fine with that. I'm still concerned about the previous point.

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 17, 2017

It looks like all that protects the daemonset from creating multiple pods on a node is the expectation system, which is single process based.

  1. new node is added at RV=9
  2. controller creates pod at RV=10
  3. controller crashes
  4. new controller pod cache syncs to pods at RV=9 (this happens all the time for multiple reasons in the api server)
  5. controller assumes pod caches are up to date, has no expectations, creates a second pod at RV=11

@mattfarina
Copy link
Contributor

@smarterclayton we discussed this in SIG Apps. The advice at this point was to take it to SIG Architecture in office hours this week. Would be nice to get some unblocking decisions there.

@smarterclayton
Copy link
Contributor

Spawned the last part (daemonset not being resistant to controller restart and slow caches) as kubernetes/kubernetes#54096 and summarized the discussion here.

Otherwise, this is on hold for 1.9 given sig-cluster-lifecycle having an alternative (correct?).

@diegs
Copy link
Author

diegs commented Oct 17, 2017

@smarterclayton yes, we discussed in sig-cluster-lifecycle today and the plan is to move forward with an alternative.

Personally, I'm in agreement that coming to a conclusion on the supported semantics of daemonsets should precede this discussion; once that is done then it would be nice to see if there is room for a maxSurge parameter in the rolling update strategy. For example, if maxUnavailable > 0 && maxSurge == 0 then you get at-most-one semantics, if maxUnavailable == 0 && maxSurge > 0 you get at-least-one semantics.

Once that is settled other issues brought up here (scarce resources such as hostPorts, etc) can be addressed.


### Considerations / questions

1. How are `hostPort`s handled?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since such use case brought me here I'm wondering if my use case and similar onces could be solved by using a special form of NodePort that doesn't redirect to any pod but only pods on the same node. This would avoid having to use hostPort in many cases where all you want is one pod per node that is reachable on a well known port.

I know this is beyond the scope of this, but knowing how a solution to this hostPort scenario likely would look like in the future might influence this design here.

Copy link

@zoidyzoidzoid zoidyzoidzoid Nov 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you didn't know about these:

In 1.5 and newer you can use the service.beta.kubernetes.io/external-traffic=OnlyLocal annotation, and it seems like in newer versions you can do kubectl patch svc nodeport -p '{"spec":{"externalTrafficPolicy":"Local"}}' to only allow local traffic.

@k8s-github-robot
Copy link

This PR hasn't been active in 30 days. It will be closed in 59 days (Feb 10, 2018).

cc @diegs @janetkuo @liggitt @smarterclayton

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@diegs
Copy link
Author

diegs commented Jan 8, 2018

Closing this because I'm no longer working on it and I'm not aware of what the expected future will be. We can re-open or re-visit at a later time if needed.

@diegs diegs closed this Jan 8, 2018
@krmayankk
Copy link

Why is this being abandoned ? seems like adding maxSurge would be ideal and at parity with Deployments

@curlup
Copy link

curlup commented Jul 17, 2018

I would like to know what should be done to put that suggestion back on the table?
Can we reopen it or something?

As for concrete user examples I would like to suggest this.
A monitoring agent on a node — checking health of all the hardware or/and services as well. It seems that DaemonSet is the right choice for that.
But having (for a short period of an update) 2 agents on a node, checking and reporting, is much better than, no agent reporting for a period.

In a matter of HA for such agent, it event might be desirable to have 2 agents on a node constantly, in order to mitigate that agent unavailability or crash.

@captn3m0
Copy link

We have another usecase of running ingress controller as a daemonset on a separate set of nodes. Version Updates should not result in a scenario where a node does not have a single pod running, but it currently does.

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
* Add Zhuzhenghao as member

Signed-off-by: zhuzhenghao <zhenghao.zhu@daocloud.io>

* Update members.yaml

---------

Signed-off-by: zhuzhenghao <zhenghao.zhu@daocloud.io>
Co-authored-by: Craig Box <craig.box@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.