-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PodSpec.NodeFailurePolicy = {Reschedule, Delete, Ignore} #6393
Conversation
describes what the user would like the system to do when the pod is unable to run on its bound node, such as when that node is marked as failed or removed from the cluster.
LGTM. I'd prefer that we expand this PR into a complete implementation, once there is agreement on the approach. I think that this is a good compromise between user experience and control. In particular, I think that we should set It will also simplify the user experience of |
Some things it would affect:
|
Is creating a controller really that much different than creating a singleton Pod? It doesn't have to be intrinsically harder. What is the problem with this, really? In Borg (and copycat systems), users create jobs, not tasks. What problem(s) are you trying to solve? Relevant design docs/sections: and the last paragraph of the design overview: ReplicationController name discussion: #3024 Pod migration discussion: #3949 Forgiveness proposal: #1574 We have carefully designed the division of responsibilities between pods, controllers, and services, and thoughtfully developed other fundamental tenants of the API, such as transparency of the control plane. Yes, it's not exactly the same as the Borg API, Swarm API, the GCE API, Managed Instance Groups, nor AWS's Auto-scaling Groups, but for good reasons IMO. How would "rescheduling" be implemented? A replacement pod can't have the same Name nor UID, since both old and new objects must coexist, to extract final status from the old pod, if nothing else. For the foreseeable future, the replacement can't use the same IP address, either. Nor will it have access to the same local storage. Essentially, a new pod is required. A new pod also simplifies state-machine behavior, which we've been working hard to keep simple, by keeping it monotonic and not requiring additional persistent state checkpointing. Some external schedulers, such as Mesos, also currently expect pods to be scheduled just once. So, the old pod would need to somehow spawn a new pod. What component is going to do that? Sounds like exactly what the Replication Controller component does. Is the old pod the only representation of intent that a replacement is desired? If so, we need a special mechanism to protect it from garbage collection. Also, note that etcd doesn't yet support multi-object transactions, which complicates object lifecycle. I argue that someone transitioning from a single node to a cluster wouldn't have identical behavior, because there is no reasonable way a standalone Kubelet could implement rescheduling behavior, and we've been working towards unification of the Kubelet API and the cluster-level Pod API. From the beginning, we've planned to support multiple types of controllers, covering classes of pods not currently covered (e.g., the job controller #1624), which was reaffirmed by consensus at the December gathering and again in #3058. It sounds like you also want to conflate persistent network identities with those pods, as well, without separate services, nominal services, headless services, locks, task queues, or any other mechanism. The implementation of a comparable mechanism in GCE is built upon lower-level APIs that allow names and addresses to change, and fairly complex mechanisms to support network and storage migration. The comparable API in Borg created significant complexity and limitations due to the lack of such an underlying API. The "N reschedulable pods" would require at least as much complexity to implement as nominal services #260. Anyway, last I checked, this use case was not among our 1.0 objectives. We have some tooling for creating all 3 kinds of objects in a single command for simple use cases (kubectl run-container). We have discussed simplifying replication controller creation by setting more of its fields by default (e.g., we could default the replica count to 1 and default the selector to match the pod's labels #5366). We have also discussed creating higher-level API objects (e.g., Deployment Controller) to simplify common patterns further, but when we do so, we aren't going to wrap and hide the lower-level APIs. We could create a "forever pod" controller that embedded a pod template, but it would amount to a replication controller without a replica count and selector. I'm unconvinced there's a lot of value to that. If you want to simplify, there are even more radical ways to do it, such as a minimalistic run-container-style API. However, if we were going to do something like this, we'd still expose the underlying one-off pods. We could argue about which object names should be longer. It is true that we're not introducing Kubernetes concepts (services, replication controllers, pods, labels) gently enough, nor sufficiently advertising the tooling that exists (e.g., run-container). We also have some annoying warts (initialization of service environment variables depending on creation order) and missing functionality (deployment controller, auto-sizing, usage-based scheduling). However, we should address those problems, regardless. |
And we should complete the implementation of graceful termination of replication controllers -- we already have the needed API mechanism in place. |
Fleshing out the proposal in this PR more... Upon rescheduling:
Phase would obviously become non-monotonic (which has consequences beyond just pods), and fields that are currently immutable after being initialized would become mutable. I don't know how logs from the old node would be accessed, since repurposing the name and UID and clearing Host loses the ability to name and find them. We'd need to explain phantom pods (pods still running on their original hosts though they've been rescheduled). Kubelet would need to ensure it doesn't post back status for pods that have been evicted. Kubelet would ignore NodeFailurePolicy. I'm inferring that you'd want to expose DNS for individual pods based on their resource names, despite caching problems. What would you want to occur on other calls to delete the pod? Currently that's the mechanism we use to trigger rescheduling. Would you want a special-purpose eviction/reschedule API instead, since a reschedulable pod is presumably a pet? Would you want in-place updates of reschedulable pods? So far we've avoided most in-place updates. Would you want in-place rolling updates of sets of N reschedulable pods? Again, so far we've avoided that. |
Re. "N reschedulable pods via 1 RC": Assuming you are going to want in-place rolling updates, would replication controller then need to keep an array of pod templates? For pods modified after creation, such as for auto-sizing or even just field defaulting, in-place updates imply an intelligent 3-way diff and merge essentially the same as that being implemented for configuration would be required: #1702, #6027. |
cc @thockin |
Continuing down the logical path proposed by this PR:
|
And revert the earlier decisions to create different controllers for different use cases, and instead extend replication controller to handle bounded-duration pods, subsuming the job controller and per-node controller. |
And the 3-way diff will be deemed too complex for replication controller to perform, so the defaulting policy would be changed to one where default values were implicitly inferred (#1502) or otherwise hidden from users (#1178), which would then require custom APIs for overriding user-provided values behind the scenes (e.g., for auto-sizing), as well as making it all the more critical to eliminate setting of defaults in kubectl, since that would then have different semantics than defaulting in the apiserver. Again, given replicated pets, I can also predict the request for per-replica template overrides, which would have complex interactions with scaling and rolling updates. |
Google, Facebook, Twitter, AirBnb, and many other companies are using scheduling systems whose primary interfaces are collection-level APIs:
|
I acknowledge the conceptual gap between what people know to ask for and The first thing that popped into my head is that UID is no longer U. We'd Aside from that, the user's intent is already captured in the restartPolicy And what of volumes - those are defined as "scoped to pod lifetime"? If we really want an immortal pod, we should build it as a clean layer on On Thu, Apr 2, 2015 at 10:54 PM, Brian Grant notifications@github.com
|
These Mortal Pods would be a great name for a band
|
+1 to the band name. Even on a single node, users need to wrap |
I think the usability related problems wrt single pods are solvable. The "surprise" aspect I agree is a problem, but as Brian notes is very similar to a user going onto their system and running "./apache -C someconfig" and expecting that shell command to survive a restart. Proper explanations, like "a pod is an instance of a process running in the cluster" leads to "oh, I need a process manager" which Overseer or Supervisor would both convey.
|
I think the key difference here is that: kubectl run-container ... which creates a replication controller is inherently more confusing that kubectl run ... For example, how do I know what to delete? If I delete the Pod, I'm going to get confused because it will get created again. If I list pods, I'm going to be confused because there is random strings at the end of the name. @smarterclayton 's comparison to a process is misleading, because in this case, it is not the process that has crashed, but rather the manager. A good operating system will deal with the failure of a processor, by moving the process onto a different processor, not by terminating the process. I think that we need to provide an equivalent experience to end users, while at the same time telling them they may want to use other interfaces to the system. Regarding @thockin 's point about UID, the UID is still a UID, the pod has just moved host. The behavior is to clear the Host field and let the pod be re-scheduled onto a different machine. Yes this will have some ugliness in terms of volumes, restart count, and more, but I think that ugliness and lack of purity is way better than the ugliness in the user experience due to forcing them to learn two abstractions just to run one thing. Especially because those experiences are going to crop up only in the 1% case of node failure, rather than the 99% case of experiencing the tooling. I think that we have to admit that this is an ugly necessity, and one that is a dead end. There is no migration path from permanent pods into controllers. Controllers will always set When you get started with Node.js, it says: "./node foo.js" it doesn't say: "configure your systemd infrastructure, etc", we need to provide a similar experience to end users. |
Also, @bgrant0607 regarding large internet-scale companies that use collections, no one is arguing that controllers aren't the right level for people to use in the end, we are simply arguing that the leap from zero up to controllers is too high. We have repeatedly gotten the feedback from users that our concept count is too high, we need to create a nice, smooth, incremental path from container -> Pod -> Controller -> Service |
I think the comparison to using a process manager is appropriate. Processes can fail due to various external factors. With respect to run-container: My proposal (#1695) was to decouple object expansion from the verbs, so that create, get, update, delete could be applied to the container abstraction. This could be done as a configuration generator or as a full-blown API layer. A configuration generator would not hide the underlying objects, however. With respect to random suffixes at the ends of names: Names of the actual docker containers are even uglier, and change every time we restart a container. The PID changes every time a process is restarted. We should just explain the pod names in the same vein. Now that v1beta3 work and other API proposals have subsided a bit, I'll work on a proposal that is more congruent with the API and system designs. A "rescheduled" pod should be a new object (so, for example, UID would change), and deletion policy (forgiveness, even if a simplified form) needs to be separate from re-creation policy. There's also the problem of where the pod template should come from, since I want to preserve the capability to initialize spec fields after creation. And, there's also the problem of how to decide when NOT to recreate a pod, such as for pods that should only be restarted on failure. I don't think we should add that capability before we have an actual job controller, which we've so far pushed beyond 1.0. |
And, there's the issue of network identity for singleton pods, which is implicit in the discussion here. I'd like a model that's not incompatible with other planned features (#260, #1607 (comment)). |
On Tue, Apr 7, 2015 at 11:07 AM, Brendan Burns notifications@github.com wrote:
As long as you are not SURE the old pod is dead the UID is not really
I don't think the analogy holds, nor does your "failed CPU" analogy. But again, This is not a reason NOT to make things simpler. Here's my If we can build a layer on top, that's simpler. If we can push this Frankly, I think that if we do this, it will just become the de facto |
Interesting discussion. I agree 1000% with the problem statement as elucidated by @alex-mohr and @brendandburns . But I tend to agree with @thockin that at this point, it is more feasible to accomplish this goal by building something on top of what we have rather than so significantly modifying the semantics of Pod. I actually think this could work, because in my simple mind I map what @alex-mohr wants to roughly the equivalent of a Borg Job, yet under the covers Borg Jobs are actually implemented using something roughly equivalent to Pods (they're called attempts) and Replication Controllers (state machines for tasks). I think it's totally fine if Pod and ReplicationController ultimately become objects that we only expect "expert" users to manipulate directly, and something more like a Borg Job becomes the way most users interact with the system. |
+1 to transparent, composable macro/helper APIs Pod is definitely intended to be a low-level primitive to be managed by controllers, and replication controllers are intended to be manipulated by deployment components/tools and auto-scalers. If we were to implement new APIs, a singleton pet API wouldn't be my first choice. It would probably be some kind of deployment controller #1743 and some mechanism for identity assignment #260. |
I think the terminology we've chosen may actually be a big part of the problem here. If the objects were called task and job rather than pod and replication controller, I expect they'd be much more approachable to newcomers even if they kept the exact same semantics. Those names are already common and match up with abstractions that most people already understand, whereas pod and replication controller sound like new things that have to be learned. The simpler names would also make it easier to grok that job is the primary level of abstraction that most people should interact with - as things are, we've pushed the pod concept heavily enough that it's not easy to recognize that it's usually better to create replication controllers. Ignoring the question of whether pods should be re-schedulable, I'd very much be in favor of a job controller (preferably referred to as a job, not a job controller) as initially proposed in #1624, which could then be presented as the core unit of work to newcomers. |
I agree that Job is preferred over JobController. Note that that's intended only for terminating (bounded-duration) workloads, not for indefinitely serving workloads. We're converging on Deployment for the latter (#1743). I would have preferred to rename replication controller to something more appropriate #3024, but lost that battle. Task is used by Borg and its clones, and by ECS, but I personally find it somewhat inappropriate. The term pod was chosen to be suggestive of its meaning, a play on Docker's logo, and concise. I'm not 100% attached to it, but it would be expensive to change at this point. |
A few comments on @a-robinson's comments:
|
|
A start to updating PodSpec to include NodeFailurePolicy, which
describes what the user would like the system to do when the pod is
unable to run on its bound node, such as when that node is marked as
failed or removed from the cluster.
NodeFailurePolicy may be Reschedule, Delete, or Ignore, with a
default policy of Reschedule.
Default behavior will change such that pods reschedule in the presence
of node failures by restarting on working nodes instead of being
deleted. And if users do not want the default behavior, they're able
to opt for deletion -- or completely disable those automatics by
setting the policy to Ignore and using some other mechanism to react.
This PR is only a sketch of the API and discusses the rationale and
proposed plan. After general consensus on the overall structure, I'll
implement it. Plan: update the rest of pkg/api to support this field,
add inter-version conversion logic, update the existing pod eviction
code in cloudprovider/controller/nodecontroller.go to reschedule or
delete based on policy, add tests, and update docs to reflect new
usage.
I realize we've discussed similar topics before (most recently #5353,
but also #5334, #5335, and #260) with alternate approaches based on
philosophical perspectives as to what pods should be, but on further
reflection, I think we can do better by our users by meeting them
where they are and being pragmatic about allowing pods to survive
failures by rescheduling onto other nodes.
In particular, pod lifetime becomes bound to cluster lifetime rather
than the lifetime of a given node. That change furthers our goal of
helping users to think about pods and containers rather than specific
nodes. Consider two specific use cases: (1) moving from node to
cluster and (2) resizing a cluster.
(1) If someone is using Kubelet (or docker or monit or supervisord) on
a particular node to keep its containers running, when they transition
to Kubernetes, they receive a container cluster equivalent and
their created pods benefit from the new environment.
(2) With our current behavior, on static clusters composed of robust
VMs, pods run for a long time. But if the cluster is resized smaller,
pods on deleted nodes unexpectedly disappear. With reschedulable
pods, they'll keep existing and running.
Other benefits:
Yes, there will be costs to this change. I rather-obviously think the
benefits outweigh those. We won't be able to delete pods when it
might be convenient to do so. And with some of the current overlay
networks, in the short term a pod will change IP addresses because pod
IPs are drawn from ranges bound to a particular host (though
presumably we'll eventually remove that constraint). It's important
that Kubernetes provide a great Container Cluster Construction Set --
but it's even more important that the out-of-the-box experience for
vanilla Kubernetes be a great experience too.
@smarterclayton, @erictune, @brendandburns because they've at various
times seemed seemed to think it might be reasonable to reschedule
pods. And @bgrant0607 for obvious reasons.