Default node termination policy is inappropriate for StatefulSets #177

akshayks · 2019-10-18T18:22:33Z

We recently introduced escalator into our platform as a means to auto scale worker nodes in EKS. It has worked great for stateless workloads, realized as Kubernetes Deployments. However, when attempting to use escalator for ASGs with StatefulSets, we observed behavior that was, at least initially, unexpected.

When scaling down a StatefulSet, Kubernetes terminates pods in reverse order of the pod ordinal / index. As an example, for a StatefulSet with n replicas, pod n-1 is terminated first followed by n-2 and so on. Given that we run a single instance of StatefulSet pods per EKS worker node, what we see is that Kubernetes scales down the newest pod (pod with highest index) while escalator terminates the oldest node first. This results in the following sequence of events:

Kubernetes terminates pod n-1 running on node n-1 (one of the newer nodes).
escalator determines a scale down action is required.
escalator terminates node 0 (oldest node). This results in pod 0 being evicted and ending up in PENDING state.
Kubernetes reschedules pod 0 on node n-1.

As you can imagine, this sequence is disruptive as a stateful workload is being forcefully relocated when terminating the newly vacated node might have been sufficient. Reading through the escalator documentation seemed to suggest this was indeed expected. However, these two statements in particular seemed contradictory to each other:

https://github.com/atlassian/escalator/blob/master/docs/scale-process.md

If hard_delete_grace_period is reached, the node will be terminated regardless, even if there are sacred pods running on it.

https://github.com/atlassian/escalator/blob/master/docs/configuration/nodegroup.md

Remove any nodes that have already been tainted and have exceed the grace period and are considered empty.

In summary, I have the following comments / questions:

Do you believe the docs need updating to modify the second statement above? It appears escalator does not exclude non-empty nodes when scaling down while the statement seems to suggest it terminates nodes only if empty.
To work around the issue, we modified escalator to only consider empty nodes during scale down. Obviously, this violates assumptions in the rest of the code base and causes unit test failures. What would the most idiomatic fix to our problem? I suppose Different node selection methods for termination #105 would help but I don't see any progress on that ticket at this time.

Thanks in advance for any help on this topic and for an amazing product.

The text was updated successfully, but these errors were encountered:

awprice · 2019-10-18T21:07:19Z

@akshayks

Thanks for using Escalator!

It sounds like Escalator's current method for selecting nodes to terminate for scale down (oldest first) isn't the most ideal for your use case. #105 definitely sounds like it could help, and a least utilised method for selecting nodes to terminate is probably the best option. We were a little hesitant to implement this as it can lead to situations where some nodes stay around for ever and aren't ever terminated because they are always being utilised.

I agree, the wording on that second statement is a little vague, this is probably better:

Remove any nodes that have already been tainted and have exceed the grace period and/or considered empty.

Have you considered https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler ? Escalator is very narrowly focused towards job based workloads that will start and always stop, and wasn't really designed with deployments/statefulsets in mind as you've noticed.

akshayks · 2019-10-21T20:10:31Z

@awprice Appreciate your quick response.

As you can imagine, the cluster autoscaler was the first option we explored. However, there seem to be multiple caveats w.r.t auto scaling groups spanning multiple availability zones that reduced its appeal. Escalator was highly recommended as an alternative which lead us to evaluate and eventually adopt it for our use case.

Is #105 on your roadmap at this time? If not, I suppose we could look into contributing a PR with some guidance from the escalator team. This is rather critical for us and any visibility/assistance you can provide on this topic will be very useful to us.

Jacobious52 · 2019-10-22T02:54:42Z

Hi @akshayks

Yes unfortunately we also understand the caveats of cluster autoscaler. If you're having difficulties with StatefulSets and Volumes across multiple regions with cluster autoscaler, one thing you can do is split your nodegroup across separate ASGs (1 for each AZ). Here's an example of that..

However if Escalator is the way to go for your use case, we'd certainly be happy for you to contribute! We can provide some visibility and guidelines if you want to implement #105

Please make sure all the changes are backwards compatible with functionality and configuration and don't change the default behaviour. For example if you add a configuration option to choose between oldest and least_used, please make oldestthe default when the key isn't given.

Also as long as you contribute to the unit tests with the changed behaviour everything should be good.

sort.go is where the current selection ordering comes from. It implements the sort.Interface which is used in taintOldestN when selecting the nodes to taint first and untaintNewestN for reversing that decision when needing to scale up by untainting tainted nodes first.
You could probably start by implementing your own sort.Interface for both tainting and untainting which could then be passed to these functions. Finally implementing a simple method in the config of choosing which sorter to use.

When using a least utilized method you'll probably want to use Allocatable instead of Capacity for the node usage, and may want to think about using least_used as a percentage of the nodes usage rather than raw values (cpu,mem). One reason for this is escalator supports Launch-Templates with multiple instance types, so 1000m CPU on one node may be less used in reality than say one with 100m CPU.

akshayks · 2019-10-23T22:02:50Z

Hi @Jacobious52

We did consider maintaining one ASG per AZ but the increased complexity that would introduce (3x the number of ASGs per cluster across multiple clusters) drove us towards alternate solutions.

Thank you for the detailed guide on implementing #105. This definitely gives us a good idea of the effort needed. Quick question though - If we opted for something simpler and instead introduced a configurable parameter per node group that determined whether or not non-empty nodes should be excluded during scale down, would that be acceptable? Do you see this violating any assumptions in escalator? I suppose a "least utilized" strategy could obviate the need for such a configuration parameter as it would naturally de-prioritize empty nodes. However, I do see the value in guaranteeing the sanctity of non-empty nodes in use cases such as ours. What are your thoughts?

Jacobious52 · 2019-10-24T02:10:20Z

I see. Would you be able to briefly explain how Escalator solves that problem? From what I believe Escalator does not do anything smart with AZ balancing compared to cluster autoscaler, it just asks the cloud provider for more capacity and relies on the cloud provider to balance across AZs. Is there something we've missed that helps in your case compared to cluster autoscaler? FWIW, we mainly use Escalator for job based workload ASGs, and run 3x ASGs for stateful ASGs.

For your solution, could this alternatively be solved by adjusting the configuration to suit your requirements for not deleting nodes with pods running? This could possibly done by setting hard_delete_grace_period to an extremely large number, so the nodes are never force deleted until they are empty? If this method for works for you, we'd be willing to accept an option to have for example, 0 as "infinite", rather than a separate setting?

akshayks · 2019-11-05T19:27:44Z

@Jacobious52 Sorry for the radio silence. Your proposal sounds reasonable and I believe we should be able to contribute a PR for it.

As I understand it, both escalator and the cluster auto scaler make API calls to EC2 for scaling up or scaling down ASGs. The only difference is that escalator does not require maintaining a separate ASG per AZ. The argument simply is, everything else being the same (or similar), it is preferable to use an auto scaler that results in less operational complexity. The workloads we are looking to autoscale do not require persistent storage and therefore are not sensitive to AZ placement. Issues like instance unavailability, AZ failure etc are handled automatically by EC2 (Scaling up automatically spawns instances in AZs where capacity is available). Let me know if this makes sense.

awprice added help wanted Extra attention is needed question Further information is requested labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default node termination policy is inappropriate for StatefulSets #177

Default node termination policy is inappropriate for StatefulSets #177

akshayks commented Oct 18, 2019

awprice commented Oct 18, 2019

akshayks commented Oct 21, 2019

Jacobious52 commented Oct 22, 2019 •

edited

Loading

akshayks commented Oct 23, 2019

Jacobious52 commented Oct 24, 2019 •

edited

Loading

akshayks commented Nov 5, 2019

Default node termination policy is inappropriate for StatefulSets #177

Default node termination policy is inappropriate for StatefulSets #177

Comments

akshayks commented Oct 18, 2019

awprice commented Oct 18, 2019

akshayks commented Oct 21, 2019

Jacobious52 commented Oct 22, 2019 • edited Loading

akshayks commented Oct 23, 2019

Jacobious52 commented Oct 24, 2019 • edited Loading

akshayks commented Nov 5, 2019

Jacobious52 commented Oct 22, 2019 •

edited

Loading

Jacobious52 commented Oct 24, 2019 •

edited

Loading