[Termination Handler] Explore graceful termination for EC2 Instances #105

ellistarn · 2020-11-03T18:47:49Z

As described in the Design, termination handlers must be layered independently on top of Karpenter's autoscaler. By design, the node termination handler should have no knowledge of autoscaling behavior or configuration, or what even triggered the scale down (e.g. manual, preemption, autoscaling).

Potential requirements/solutions include:

Protect instances that are being deleted/scaled down to respect poddisruptionbudgets
Build a Karpenter CRD to model lifecycle hooks?
Use some sort of CloudProvider model to hook into EC2 lifecycle hooks to protect instances.

ellistarn · 2020-11-03T18:48:18Z

This may be a complete solution https://github.com/aws/aws-node-termination-handler with no work on our side.

ellistarn · 2020-11-03T18:48:33Z

Also worth investigating this guy: https://github.com/pusher/k8s-spot-termination-handler

bwagner5 · 2020-11-03T20:58:56Z

I think the aws-node-termination-handler would integrate with no additional work required for Karpenter. NTH does require a quite a bit of customer setup (creating the lifecycle hooks, eventbridge rules, and SQS queue), but after the initial setup, it can respond to a lot of events very easily.

Since Karpenter is already managing node groups, I'm curious if some of that setup (at least the lifecycle hooks) could be abstracted away and then NTH could just plugin to handle them or receive them from karpenter? The plan for NTH to ease the setup burden is to integrate with ACK but it's still early and most of the resources we need are not available yet.

Knowledge of the actual event does come in handy when processing events. NTH takes slightly different actions depending on the event. For example, ASG lifecycle terminations have different post draining actions than an EC2 Status Change event since the lifecycle hook does not need to be completed. Also EC2 scheduled maintenance event reboots are handled differently in the NTH IMDS processor since the node can be labeled and automatically brought back into service after the reboot.

ellistarn · 2021-06-04T00:06:39Z

My current stance on this is that we should recommend that users rely on https://github.com/aws/aws-node-termination-handler for at least v0.4. We should explore building a karpenter native interruption handler, but this use case (rebalance) should be scoped into the defragmentation design.

ellistarn added the hackathon label Nov 3, 2020

njtran mentioned this issue May 17, 2021

Graceful Termination Design Doc #405

Merged

bwagner5 closed this as completed Sep 23, 2021

bwagner5 mentioned this issue Sep 23, 2021

Native Support for Spot Termination #702

Closed

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this issue Nov 25, 2023

test: ensure any waiters are reset between tests (aws#105)

1f047d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Termination Handler] Explore graceful termination for EC2 Instances #105

[Termination Handler] Explore graceful termination for EC2 Instances #105

ellistarn commented Nov 3, 2020 •

edited

Loading

ellistarn commented Nov 3, 2020

ellistarn commented Nov 3, 2020

bwagner5 commented Nov 3, 2020

ellistarn commented Jun 4, 2021

[Termination Handler] Explore graceful termination for EC2 Instances #105

[Termination Handler] Explore graceful termination for EC2 Instances #105

Comments

ellistarn commented Nov 3, 2020 • edited Loading

ellistarn commented Nov 3, 2020

ellistarn commented Nov 3, 2020

bwagner5 commented Nov 3, 2020

ellistarn commented Jun 4, 2021

ellistarn commented Nov 3, 2020 •

edited

Loading