Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PodSpec.NodeFailurePolicy = {Reschedule, Delete, Ignore} #6393

Closed
wants to merge 1 commit into from
Closed

Add PodSpec.NodeFailurePolicy = {Reschedule, Delete, Ignore} #6393

wants to merge 1 commit into from

Conversation

alex-mohr
Copy link
Contributor

A start to updating PodSpec to include NodeFailurePolicy, which
describes what the user would like the system to do when the pod is
unable to run on its bound node, such as when that node is marked as
failed or removed from the cluster.

NodeFailurePolicy may be Reschedule, Delete, or Ignore, with a
default policy of Reschedule.

Default behavior will change such that pods reschedule in the presence
of node failures by restarting on working nodes instead of being
deleted. And if users do not want the default behavior, they're able
to opt for deletion -- or completely disable those automatics by
setting the policy to Ignore and using some other mechanism to react.

This PR is only a sketch of the API and discusses the rationale and
proposed plan. After general consensus on the overall structure, I'll
implement it. Plan: update the rest of pkg/api to support this field,
add inter-version conversion logic, update the existing pod eviction
code in cloudprovider/controller/nodecontroller.go to reschedule or
delete based on policy, add tests, and update docs to reflect new
usage.

I realize we've discussed similar topics before (most recently #5353,
but also #5334, #5335, and #260) with alternate approaches based on
philosophical perspectives as to what pods should be, but on further
reflection, I think we can do better by our users by meeting them
where they are and being pragmatic about allowing pods to survive
failures by rescheduling onto other nodes.

In particular, pod lifetime becomes bound to cluster lifetime rather
than the lifetime of a given node. That change furthers our goal of
helping users to think about pods and containers rather than specific
nodes. Consider two specific use cases: (1) moving from node to
cluster and (2) resizing a cluster.

(1) If someone is using Kubelet (or docker or monit or supervisord) on
a particular node to keep its containers running, when they transition
to Kubernetes, they receive a container cluster equivalent and
their created pods benefit from the new environment.

(2) With our current behavior, on static clusters composed of robust
VMs, pods run for a long time. But if the cluster is resized smaller,
pods on deleted nodes unexpectedly disappear. With reschedulable
pods, they'll keep existing and running.

Other benefits:

  • Names (IDs) remain unchanged as a pod moves around the cluster, which helps debugging and introspection.
  • Users can more-incrementally adopt Kubernetes concepts: they need only use pods and not replication controllers at first.
  • Replication controllers potentially become simpler: their config describes a factory that is able to create pods with specified properties, but the handling of specific names and ids is reflected in the pod.
    • E.g. consider the difference between "an EBS volume of 100GB" and "specific EBS volume Foo".
    • ... and we no longer need to create replication controllers of size 1.
    • ... and it becomes feasible to have N reschedulable pods via 1 RC instead of N RCs. (Or write a RCRC to replicate RCs of size 1.)

Yes, there will be costs to this change. I rather-obviously think the
benefits outweigh those. We won't be able to delete pods when it
might be convenient to do so. And with some of the current overlay
networks, in the short term a pod will change IP addresses because pod
IPs are drawn from ranges bound to a particular host (though
presumably we'll eventually remove that constraint). It's important
that Kubernetes provide a great Container Cluster Construction Set --
but it's even more important that the out-of-the-box experience for
vanilla Kubernetes be a great experience too.

@smarterclayton, @erictune, @brendandburns because they've at various
times seemed seemed to think it might be reasonable to reschedule
pods. And @bgrant0607 for obvious reasons.

Review on Reviewable

describes what the user would like the system to do when the pod is
unable to run on its bound node, such as when that node is marked as
failed or removed from the cluster.
@brendandburns
Copy link
Contributor

@bgrant0607

LGTM. I'd prefer that we expand this PR into a complete implementation, once there is agreement on the approach.

I think that this is a good compromise between user experience and control. In particular, I think that we should set NodeFailurePolicy = "Delete" for replication controller created Pods, since a higher level construct is responsible for their restart.

It will also simplify the user experience of kubectl run-container ... --replicas=1 since right now it creates a replication controller, without really explaining to the user what it is doing, or how to delete the RC and pods that are created.

@smarterclayton
Copy link
Contributor

Some things it would affect:

  • initializers would not be able to run without knowledge that the node is down (scheduler would have to clear instance specific data on the pod)
  • a failed run once pod on a node might be rescheduled to another node (and started) before the kubelet reports the pod has failed. (run once pods should not be allowed to be rescheduled?)
  • pods cannot reason about their past placement in external systems (can store podname+host)

@bgrant0607
Copy link
Member

Is creating a controller really that much different than creating a singleton Pod? It doesn't have to be intrinsically harder. What is the problem with this, really? In Borg (and copycat systems), users create jobs, not tasks. What problem(s) are you trying to solve?

Relevant design docs/sections:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/pods.md#durability-of-pods-or-lack-thereof
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/replication-controller.md#responsibilities-of-the-replication-controller

and the last paragraph of the design overview:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/DESIGN.md#overview

ReplicationController name discussion: #3024

Pod migration discussion: #3949

Forgiveness proposal: #1574

We have carefully designed the division of responsibilities between pods, controllers, and services, and thoughtfully developed other fundamental tenants of the API, such as transparency of the control plane.

Yes, it's not exactly the same as the Borg API, Swarm API, the GCE API, Managed Instance Groups, nor AWS's Auto-scaling Groups, but for good reasons IMO.

How would "rescheduling" be implemented? A replacement pod can't have the same Name nor UID, since both old and new objects must coexist, to extract final status from the old pod, if nothing else. For the foreseeable future, the replacement can't use the same IP address, either. Nor will it have access to the same local storage. Essentially, a new pod is required. A new pod also simplifies state-machine behavior, which we've been working hard to keep simple, by keeping it monotonic and not requiring additional persistent state checkpointing. Some external schedulers, such as Mesos, also currently expect pods to be scheduled just once.

So, the old pod would need to somehow spawn a new pod. What component is going to do that? Sounds like exactly what the Replication Controller component does. Is the old pod the only representation of intent that a replacement is desired? If so, we need a special mechanism to protect it from garbage collection. Also, note that etcd doesn't yet support multi-object transactions, which complicates object lifecycle.

I argue that someone transitioning from a single node to a cluster wouldn't have identical behavior, because there is no reasonable way a standalone Kubelet could implement rescheduling behavior, and we've been working towards unification of the Kubelet API and the cluster-level Pod API.

From the beginning, we've planned to support multiple types of controllers, covering classes of pods not currently covered (e.g., the job controller #1624), which was reaffirmed by consensus at the December gathering and again in #3058.

It sounds like you also want to conflate persistent network identities with those pods, as well, without separate services, nominal services, headless services, locks, task queues, or any other mechanism. The implementation of a comparable mechanism in GCE is built upon lower-level APIs that allow names and addresses to change, and fairly complex mechanisms to support network and storage migration. The comparable API in Borg created significant complexity and limitations due to the lack of such an underlying API. The "N reschedulable pods" would require at least as much complexity to implement as nominal services #260. Anyway, last I checked, this use case was not among our 1.0 objectives.

We have some tooling for creating all 3 kinds of objects in a single command for simple use cases (kubectl run-container). We have discussed simplifying replication controller creation by setting more of its fields by default (e.g., we could default the replica count to 1 and default the selector to match the pod's labels #5366).

We have also discussed creating higher-level API objects (e.g., Deployment Controller) to simplify common patterns further, but when we do so, we aren't going to wrap and hide the lower-level APIs.

We could create a "forever pod" controller that embedded a pod template, but it would amount to a replication controller without a replica count and selector. I'm unconvinced there's a lot of value to that. If you want to simplify, there are even more radical ways to do it, such as a minimalistic run-container-style API. However, if we were going to do something like this, we'd still expose the underlying one-off pods. We could argue about which object names should be longer.

It is true that we're not introducing Kubernetes concepts (services, replication controllers, pods, labels) gently enough, nor sufficiently advertising the tooling that exists (e.g., run-container). We also have some annoying warts (initialization of service environment variables depending on creation order) and missing functionality (deployment controller, auto-sizing, usage-based scheduling). However, we should address those problems, regardless.

@bgrant0607
Copy link
Member

And we should complete the implementation of graceful termination of replication controllers -- we already have the needed API mechanism in place.

@bgrant0607
Copy link
Member

Fleshing out the proposal in this PR more...

Upon rescheduling:

  • Spec.Host would be cleared
  • All of Status would be essentially reset to the initial state:
    • Status.HostIP and Status.PodIP would be cleared
    • Phase would revert to Pending
    • Conditions would be cleared
    • ContainerStatuses would be cleared
    • Message would indicate the pod was evicted
  • Annotations set upon binding might need to stay as-is until the scheduler observed the pod again -- this was discussed in the context of Binding annotations should be preserved alongside Host assignment #4103 but I don't remember the details

Phase would obviously become non-monotonic (which has consequences beyond just pods), and fields that are currently immutable after being initialized would become mutable.

I don't know how logs from the old node would be accessed, since repurposing the name and UID and clearing Host loses the ability to name and find them.

We'd need to explain phantom pods (pods still running on their original hosts though they've been rescheduled).

Kubelet would need to ensure it doesn't post back status for pods that have been evicted.

Kubelet would ignore NodeFailurePolicy.

I'm inferring that you'd want to expose DNS for individual pods based on their resource names, despite caching problems.

What would you want to occur on other calls to delete the pod? Currently that's the mechanism we use to trigger rescheduling. Would you want a special-purpose eviction/reschedule API instead, since a reschedulable pod is presumably a pet?

Would you want in-place updates of reschedulable pods? So far we've avoided most in-place updates.

Would you want in-place rolling updates of sets of N reschedulable pods? Again, so far we've avoided that.

@bgrant0607
Copy link
Member

Re. "N reschedulable pods via 1 RC":

Assuming you are going to want in-place rolling updates, would replication controller then need to keep an array of pod templates?

For pods modified after creation, such as for auto-sizing or even just field defaulting, in-place updates imply an intelligent 3-way diff and merge essentially the same as that being implemented for configuration would be required: #1702, #6027.

@bgrant0607
Copy link
Member

cc @thockin

@bgrant0607
Copy link
Member

Continuing down the logical path proposed by this PR:

  • Change from randomly generated name suffixes to predictably, densely indexed pod names
  • Merge the service API into the replication controller API

@bgrant0607
Copy link
Member

And revert the earlier decisions to create different controllers for different use cases, and instead extend replication controller to handle bounded-duration pods, subsuming the job controller and per-node controller.

@bgrant0607
Copy link
Member

And the 3-way diff will be deemed too complex for replication controller to perform, so the defaulting policy would be changed to one where default values were implicitly inferred (#1502) or otherwise hidden from users (#1178), which would then require custom APIs for overriding user-provided values behind the scenes (e.g., for auto-sizing), as well as making it all the more critical to eliminate setting of defaults in kubectl, since that would then have different semantics than defaulting in the apiserver.

Again, given replicated pets, I can also predict the request for per-replica template overrides, which would have complex interactions with scaling and rolling updates.

@ghost
Copy link

ghost commented Apr 3, 2015

I acknowledge the conceptual gap between what people know to ask for and
what we're offering them, and I 100% agree that we should try to close the
gap. That said, I don't think that changing Pod into a durable thing is
the way.

The first thing that popped into my head is that UID is no longer U. We'd
need an ID that represents the abstract pod and an ID that represents each
attempt to manifest that abstract pod. Then we have to retool everything
to know the difference and deal with the inevitable fallout of getting the
wrong one (or always treat them as indices into the same data).

Aside from that, the user's intent is already captured in the restartPolicy
field - why add two levels of policy that control the same idea, especially
in a world where we don't actually want users thinking about nodes?

And what of volumes - those are defined as "scoped to pod lifetime"?
Things like shared memory segments are similarly scoped, but those are at
least obviously host-bound.

If we really want an immortal pod, we should build it as a clean layer on
mortal pods - much like the suggested "deployment" object which would build
on top of replication controllers. Could we build a Job object instead,
and avoid some of the pain?

On Thu, Apr 2, 2015 at 10:54 PM, Brian Grant notifications@github.com
wrote:

Google, Facebook, Twitter, AirBnb, and many other companies are using
scheduling systems whose primary interfaces are collection-level APIs:


Reply to this email directly or view it on GitHub
#6393 (comment)
.

@smarterclayton
Copy link
Contributor

These Mortal Pods would be a great name for a band

On Apr 3, 2015, at 2:08 AM, thockin-cc notifications@github.com wrote:

I acknowledge the conceptual gap between what people know to ask for and
what we're offering them, and I 100% agree that we should try to close the
gap. That said, I don't think that changing Pod into a durable thing is
the way.

The first thing that popped into my head is that UID is no longer U. We'd
need an ID that represents the abstract pod and an ID that represents each
attempt to manifest that abstract pod. Then we have to retool everything
to know the difference and deal with the inevitable fallout of getting the
wrong one (or always treat them as indices into the same data).

Aside from that, the user's intent is already captured in the restartPolicy
field - why add two levels of policy that control the same idea, especially
in a world where we don't actually want users thinking about nodes?

And what of volumes - those are defined as "scoped to pod lifetime"?
Things like shared memory segments are similarly scoped, but those are at
least obviously host-bound.

If we really want an immortal pod, we should build it as a clean layer on
mortal pods - much like the suggested "deployment" object which would build
on top of replication controllers. Could we build a Job object instead,
and avoid some of the pain?

On Thu, Apr 2, 2015 at 10:54 PM, Brian Grant notifications@github.com
wrote:

Google, Facebook, Twitter, AirBnb, and many other companies are using
scheduling systems whose primary interfaces are collection-level APIs:


Reply to this email directly or view it on GitHub
#6393 (comment)
.


Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

+1 to the band name.

Even on a single node, users need to wrap docker run with a process supervisor, such as systemd. Is a replication controller so different? This is one reason why I wanted to rename ReplicationController to Overseer -- it's really a self-healing mechanism more than anything else. It happens to restart N instances instead of 1.

@smarterclayton
Copy link
Contributor

I think the usability related problems wrt single pods are solvable. The "surprise" aspect I agree is a problem, but as Brian notes is very similar to a user going onto their system and running "./apache -C someconfig" and expecting that shell command to survive a restart. Proper explanations, like "a pod is an instance of a process running in the cluster" leads to "oh, I need a process manager" which Overseer or Supervisor would both convey.

On Apr 3, 2015, at 12:22 PM, Brian Grant notifications@github.com wrote:

+1 to the band name.

Even on a single node, users need to wrap docker run with a process supervisor, such as systemd. Is a replication controller so different? This is one reason why I wanted to rename ReplicationController to Overseer -- it's really a self-healing mechanism more than anything else. It happens to restart N instances instead of 1.


Reply to this email directly or view it on GitHub.

@brendandburns
Copy link
Contributor

I think the key difference here is that:

kubectl run-container ...

which creates a replication controller is inherently more confusing that

kubectl run ...

For example, how do I know what to delete? If I delete the Pod, I'm going to get confused because it will get created again. If I list pods, I'm going to be confused because there is random strings at the end of the name.

@smarterclayton 's comparison to a process is misleading, because in this case, it is not the process that has crashed, but rather the manager. A good operating system will deal with the failure of a processor, by moving the process onto a different processor, not by terminating the process. I think that we need to provide an equivalent experience to end users, while at the same time telling them they may want to use other interfaces to the system.

Regarding @thockin 's point about UID, the UID is still a UID, the pod has just moved host. The behavior is to clear the Host field and let the pod be re-scheduled onto a different machine. Yes this will have some ugliness in terms of volumes, restart count, and more, but I think that ugliness and lack of purity is way better than the ugliness in the user experience due to forcing them to learn two abstractions just to run one thing. Especially because those experiences are going to crop up only in the 1% case of node failure, rather than the 99% case of experiencing the tooling.

I think that we have to admit that this is an ugly necessity, and one that is a dead end. There is no migration path from permanent pods into controllers. Controllers will always set NodeFailurePolicy=delete. That's ok, when people are ready to consume the high level abstractions, we will be there for them, just like Monit/Systemd/etc are there for people when they are ready to move from a process running in the background to a process that is started by the system. But that's a conceptual leap in Unix too, and asking someone to start with systemd would be just as much of a non-starter for end users, as it is to ask a user to start with a replication controller.

When you get started with Node.js, it says: "./node foo.js" it doesn't say: "configure your systemd infrastructure, etc", we need to provide a similar experience to end users.

@brendandburns
Copy link
Contributor

Also, @bgrant0607 regarding large internet-scale companies that use collections, no one is arguing that controllers aren't the right level for people to use in the end, we are simply arguing that the leap from zero up to controllers is too high.

We have repeatedly gotten the feedback from users that our concept count is too high, we need to create a nice, smooth, incremental path from container -> Pod -> Controller -> Service

@bgrant0607
Copy link
Member

I think the comparison to using a process manager is appropriate. Processes can fail due to various external factors.

With respect to run-container: My proposal (#1695) was to decouple object expansion from the verbs, so that create, get, update, delete could be applied to the container abstraction. This could be done as a configuration generator or as a full-blown API layer. A configuration generator would not hide the underlying objects, however.

With respect to random suffixes at the ends of names: Names of the actual docker containers are even uglier, and change every time we restart a container. The PID changes every time a process is restarted. We should just explain the pod names in the same vein.

Now that v1beta3 work and other API proposals have subsided a bit, I'll work on a proposal that is more congruent with the API and system designs. A "rescheduled" pod should be a new object (so, for example, UID would change), and deletion policy (forgiveness, even if a simplified form) needs to be separate from re-creation policy. There's also the problem of where the pod template should come from, since I want to preserve the capability to initialize spec fields after creation.

And, there's also the problem of how to decide when NOT to recreate a pod, such as for pods that should only be restarted on failure. I don't think we should add that capability before we have an actual job controller, which we've so far pushed beyond 1.0.

@bgrant0607
Copy link
Member

And, there's the issue of network identity for singleton pods, which is implicit in the discussion here. I'd like a model that's not incompatible with other planned features (#260, #1607 (comment)).

@thockin
Copy link
Member

thockin commented Apr 9, 2015

On Tue, Apr 7, 2015 at 11:07 AM, Brendan Burns notifications@github.com wrote:

Regarding @thockin 's point about UID, the UID is still a UID, the pod has just moved host. The behavior is to clear the Host field and let the pod be re-scheduled onto a different machine. Yes this will have some ugliness in terms of volumes, restart count, and more, but I think that ugliness and lack of purity is way better than the ugliness in the user experience due to forcing them to learn two abstractions just to run one thing. Especially because those experiences are going to crop up only in the 1% case of node failure, rather than the 99% case of experiencing the tooling.

As long as you are not SURE the old pod is dead the UID is not really
safe to re-use. Even if it is guaranteed, I'm not sure reusing the
UID is correct. This is not, of itself, a reason not to pursue ideas
here.

When you get started with Node.js, it says: "./node foo.js" it doesn't say: "configure your systemd infrastructure, etc", we need to provide a similar experience to end users.

I don't think the analogy holds, nor does your "failed CPU" analogy.
If a CPU starts failing in a hypothetical shared-memory system, and
your app was running on that CPU it is not safe to just restart your
app on another CPU if you are not sure the old one is gone. I you
want to draw analogies, they really should be comparable. My point
being that there are things in this world that are different and don't
have good analogs in a single-machine universe. This is part of why
we model things the way we do, rather than pretend the cluster is a
single host.

But again, This is not a reason NOT to make things simpler. Here's my
real concern with this proposal: It's too late in the game for such a
fundamental design change. I'm already concerned that we've taken on
a lot of churn of late that has caused some real regressions. We do
not have a stable enough base to accept a change like this right now.
Not as long as we're still pushing for 1.0 in O(months).

If we can build a layer on top, that's simpler. If we can push this
discussion until after 1.0, that's more tenable (though maybe
pointless).

Frankly, I think that if we do this, it will just become the de facto
way of operating. ReplicationController has no active role in this
world.

@davidopp
Copy link
Member

Interesting discussion. I agree 1000% with the problem statement as elucidated by @alex-mohr and @brendandburns . But I tend to agree with @thockin that at this point, it is more feasible to accomplish this goal by building something on top of what we have rather than so significantly modifying the semantics of Pod. I actually think this could work, because in my simple mind I map what @alex-mohr wants to roughly the equivalent of a Borg Job, yet under the covers Borg Jobs are actually implemented using something roughly equivalent to Pods (they're called attempts) and Replication Controllers (state machines for tasks). I think it's totally fine if Pod and ReplicationController ultimately become objects that we only expect "expert" users to manipulate directly, and something more like a Borg Job becomes the way most users interact with the system.

@bgrant0607
Copy link
Member

+1 to transparent, composable macro/helper APIs
-1 to opaque wrappers

Pod is definitely intended to be a low-level primitive to be managed by controllers, and replication controllers are intended to be manipulated by deployment components/tools and auto-scalers.

If we were to implement new APIs, a singleton pet API wouldn't be my first choice. It would probably be some kind of deployment controller #1743 and some mechanism for identity assignment #260.

@a-robinson
Copy link
Contributor

I think the terminology we've chosen may actually be a big part of the problem here. If the objects were called task and job rather than pod and replication controller, I expect they'd be much more approachable to newcomers even if they kept the exact same semantics. Those names are already common and match up with abstractions that most people already understand, whereas pod and replication controller sound like new things that have to be learned. The simpler names would also make it easier to grok that job is the primary level of abstraction that most people should interact with - as things are, we've pushed the pod concept heavily enough that it's not easy to recognize that it's usually better to create replication controllers.

Ignoring the question of whether pods should be re-schedulable, I'd very much be in favor of a job controller (preferably referred to as a job, not a job controller) as initially proposed in #1624, which could then be presented as the core unit of work to newcomers.

@bgrant0607
Copy link
Member

I agree that Job is preferred over JobController. Note that that's intended only for terminating (bounded-duration) workloads, not for indefinitely serving workloads. We're converging on Deployment for the latter (#1743).

I would have preferred to rename replication controller to something more appropriate #3024, but lost that battle.

Task is used by Borg and its clones, and by ECS, but I personally find it somewhat inappropriate. The term pod was chosen to be suggestive of its meaning, a play on Docker's logo, and concise. I'm not 100% attached to it, but it would be expensive to change at this point.

@davidopp
Copy link
Member

A few comments on @a-robinson's comments:

  1. I'm not sure we should rename "replication controller" to "job" but I do agree that the term "replication controller" has a couple of issues. First, as you mentioned, it is unfamiliar to people (compared to job, a concept that has been around in HPC for decades, and is used in systems like Borg, Aurora, Tupperware, etc.) Second, the name sounds like it's a component/piece of logic, yet it's actually a REST object and is defined declaratively. People probably expect the concept of the collection (and its definition) to be separate from the concept of the logic that implements the runtime behavior of the collection (I promise not to call it a state machine). But we have only a single term for this (replication controller).

  2. I don't agree about "task" though. ECS does this but I find it confusing because "task" sounds atomic, yet a task (by your definition and theirs) can contain multiple containers. I think it would be fine to say a job is composed of pods. Anyway I think it is too late for us to back out of the "pod" terminology.

  3. I think we haven't really figured out yet what is the right object for users to interact with. It's definitely not pods, for the reasons you mentioned and discussed earlier in this issue. I don't think it's replication controller either, because among other things it's complicated to reason about rolling updates at the level of replication controllers. I think ultimately we will want users to interact with some higher-level object, like maybe the deployment object, which in turn would generate replication controllers (and in turn pods). We could perhaps call the thing you feed as input to the deployment object, and that it manages, a "job." Replication controllers should be a hidden internal detail (much as task and collection state machines are hidden from users in Borg).

@tmrts
Copy link
Contributor

tmrts commented Apr 21, 2015

ReplicationController might use a more friendlier name definitely. But, ReplicationController has simpler responsibilities than a Job . it only ensures that there are a specified number of pods running. A Job might suggest having bigger responsibilities like having a deadline, goals, templates (e.g. batch processing). IMHO renaming the ReplicationController to Job would be confusing as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants