Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/lifecycle heartbeat #1116

Merged
merged 31 commits into from
Jan 29, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
d991814
add lifecycle heartbeat
Dec 31, 2024
0e4b686
Lifecycle heartbeat unit test
Dec 31, 2024
1fbd7cb
Refactor heartbeat logging statements
Jan 2, 2025
d0f1ef4
Heartbeat e2e test
Jan 2, 2025
df0696f
Merge branch 'aws:main' into add/lifecycle-heartbeat
hyeong01 Jan 2, 2025
d7f8e07
Remove error handling for using heartbeat and imds together
Jan 6, 2025
a6cfd89
add e2e test for lifecycle heartbeat
Jan 7, 2025
64e9cff
Add check heartbeat timeout and compare to heartbeat interval
Jan 7, 2025
d3047a0
Add error handling for using heartbeat and imds together
Jan 7, 2025
559adc3
fix config error message
Jan 7, 2025
7012bab
update error message for heartbeat config
Jan 7, 2025
bc79eb7
Fix heartbeat flag explanation
Jan 8, 2025
75400a9
Update readme for new heartbeat feature
Jan 8, 2025
bbddcfa
Fix readme for heartbeat section
Jan 8, 2025
029fdf7
Update readme on the concurrency of heartbeat
Jan 10, 2025
56b3f55
fix: stop heartbeat when target is invalid
Jan 10, 2025
7221ed2
Added heartbeat test for handling invalid lifecycle action
Jan 10, 2025
4bcb916
incorporated unsupoorted error types for unit testing
Jan 10, 2025
4ff40d9
fix unit-test: reset heartbeatCallCount each test
Jan 10, 2025
fe7fcc1
Merge branch 'aws:main' into add/lifecycle-heartbeat
hyeong01 Jan 15, 2025
265828d
use helper function to reduce repetitive code in heartbeat unit test
Jan 17, 2025
044fc3a
Update readme. Moved heartbeat under Queue Processor
Jan 22, 2025
2732775
Fix config.go for better readability and check until < interval
Jan 22, 2025
0492976
Update heartbeat to have better logging
Jan 22, 2025
1631bb6
Update unit test to cover whole process of heartbeat start and closure
Jan 22, 2025
b41751d
Update heartbeat e2e test. Auto-value calculations for future modific…
Jan 22, 2025
9e3fe77
Add inline comment for heartbeatUntil default behavior
Jan 22, 2025
dbdeec1
Fixed e2e variables to have double quotes
Jan 22, 2025
80b88a4
fix readme for heartbeat
Jan 23, 2025
9c54964
Added new flags in config test
Jan 23, 2025
56ea41d
Fixed typo in heartbeat e2e test
Jan 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 71 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,75 @@ When using the ASG Lifecycle Hooks, ASG first sends the lifecycle action notific
#### Queue Processor with Instance State Change Events
When using the EC2 Console or EC2 API to terminate the instance, a state-change notification is sent and the instance termination is started. EC2 does not wait for a "continue" signal before beginning to terminate the instance. When you terminate an EC2 instance, it should trigger a graceful operating system shutdown which will send a SIGTERM to the kubelet, which will in-turn start shutting down pods by propagating that SIGTERM to the containers on the node. If the containers do not shut down by the kubelet's `podTerminationGracePeriod (k8s default is 30s)`, then it will send a SIGKILL to forcefully terminate the containers. Setting the `podTerminationGracePeriod` to a max of 90sec (probably a bit less than that) will delay the termination of pods, which helps in graceful shutdown.

#### Issuing Lifecycle Heartbeats

You can set NTH to send heartbeats to ASG in Queue Processor mode. This allows for a much longer grace period (up to 48 hours) for termination than the maximum heartbeat timeout of two hours. The feature is useful when pods require long time to drain or when you need a shorter heartbeat timeout with a longer grace period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a line that explains when this feature would be useful: e.g. When a customer has pods that have long-running drain tasks.

##### How it works

- When NTH receives an ASG lifecycle termination event, it starts sending heartbeats to ASG to renew the heartbeat timeout associated with the ASG's termination lifecycle hook.
- The heartbeat timeout acts as a timer that starts when the termination event begins.
- Before the timeout reaches zero, the termination process is halted at the `Terminating:Wait` stage.
- By issuing heartbeats, graceful termination duration can be extended up to 48 hours, limited by the global timeout.

##### How to use

- Configure a termination lifecycle hook on ASG (required). Set the heartbeat timeout value to be longer than the `Heartbeat Interval`. Each heartbeat signal resets this timeout, extending the duration that an instance remains in the `Terminating:Wait` state. Without this lifecycle hook, the instance will terminate immediately when termination event occurs.
- Configure `Heartbeat Interval` (required) and `Heartbeat Until` (optional). NTH operates normally without heartbeats if neither value is set. If only the interval is specified, `Heartbeat Until` defaults to 172800 seconds (48 hours) and heartbeats will be sent. `Heartbeat Until` must be provided with a valid `Heartbeat Interval`, otherwise NTH will fail to start. Any invalid values (wrong type or out of range) will also prevent NTH from starting.

##### Configurations
###### `Heartbeat Interval` (Required)
- Time period between consecutive heartbeat signals (in seconds)
- Specifying this value triggers heartbeat
- Range: 30 to 3600 seconds (30 seconds to 1 hour)
- Flag for custom resource definition by *.yaml / helm: `heartbeatInterval`
- CLI flag: `heartbeat-interval`
- Default value: X

###### `Heartbeat Until` (Optional)
- Duration over which heartbeat signals are sent (in seconds)
- Must be provided with a valid `Heartbeat Interval`
- Range: 60 to 172800 seconds (1 minute to 48 hours)
- Flag for custom resource definition by *.yaml / helm: `heartbeatUntil`
- CLI flag: `heartbeat-until`
- Default value: 172800 (48 hours)

###### Example Case

- `Heartbeat Interval`: 1000 seconds
- `Heartbeat Until`: 4500 seconds
- `Heartbeat Timeout`: 3000 seconds

| Time (s) | Event | Heartbeat Timeout (HT) | Heartbeat Until (HU) | Action |
|----------|-------------|------------------|----------------------|--------|
| 0 | Start | 3000 | 4500 | Termination Event Received |
| 1000 | HB1 Issued | 2000 -> 3000 | 3500 | Send Heartbeat |
| 2000 | HB2 Issued | 2000 -> 3000 | 2500 | Send Heartbeat |
| 3000 | HB3 Issued | 2000 -> 3000 | 1500 | Send Heartbeat |
| 4000 | HB4 Issued | 2000 -> 3000 | 500 | Send Heartbeat |
| 4500 | HB Expires | 2500 | 0 | Stop Heartbeats |
| 7000 | Termination | - | - | Instance Terminates |

Note: The instance can terminate earlier if its pods finish draining and are ready for termination.

##### Example Helm Command

```sh
helm upgrade --install aws-node-termination-handler \
--namespace kube-system \
--set enableSqsTerminationDraining=true \
--set heartbeatInterval=1000 \
--set heartbeatUntil=4500 \
// other inputs..
```

##### Important Notes

- Be aware of global timeout. Instances cannot remain in a wait state indefinitely. The global timeout is 48 hours or 100 times the heartbeat timeout, whichever is smaller. This is the maximum amount of time that you can keep an instance in `terminating:wait` state.
- Lifecycle heartbeats are only supported in Queue Processor mode. Setting `enableSqsTerminationDraining=false` and specifying heartbeat flags is prevented in Helm. Directly editing deployment settings to bypass this will cause NTH to fail.
- The heartbeat interval should be sufficiently shorter than the heartbeat timeout. There's a time gap between instance startup and NTH initialization. Setting the interval just slightly smaller than or equal to the timeout causes the heartbeat timeout to expire before the first heartbeat is issued. Provide adequate buffer time for NTH to complete initialization.
- Issuing heartbeats is part of the termination process. The maximum number of instances that NTH can handle termination concurrently is limited by the number of workers. This implies that heartbeats can only be issued for up to the number of instances specified by the `workers` flag simultaneously.

### Which one should I use?
| Feature | IMDS Processor | Queue Processor |
| :-------------------------------------------: | :------------: | :-------------: |
Expand All @@ -91,6 +160,7 @@ When using the EC2 Console or EC2 API to terminate the instance, a state-change
| ASG Termination Lifecycle State Change | ✅ | ❌ |
| AZ Rebalance Recommendation | ❌ | ✅ |
| Instance State Change Events | ❌ | ✅ |
| Issue Lifecycle Heartbeats | ❌ | ✅ |

### Kubernetes Compatibility

Expand Down Expand Up @@ -626,5 +696,4 @@ In IMDS mode, metrics can be collected as follows:
Contributions are welcome! Please read our [guidelines](https://github.com/aws/aws-node-termination-handler/blob/main/CONTRIBUTING.md) and our [Code of Conduct](https://github.com/aws/aws-node-termination-handler/blob/main/CODE_OF_CONDUCT.md)

## License
This project is licensed under the Apache-2.0 License.

This project is licensed under the Apache-2.0 License.
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,10 @@ spec:
value: {{ .Values.deleteSqsMsgIfNodeNotFound | quote }}
- name: WORKERS
value: {{ .Values.workers | quote }}
- name: HEARTBEAT_INTERVAL
value: {{ .Values.heartbeatInterval | quote }}
- name: HEARTBEAT_UNTIL
value: {{ .Values.heartbeatUntil | quote }}
{{- with .Values.extraEnv }}
{{- toYaml . | nindent 12 }}
{{- end }}
Expand Down
36 changes: 35 additions & 1 deletion pkg/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,9 @@ const (
queueURLConfigKey = "QUEUE_URL"
completeLifecycleActionDelaySecondsKey = "COMPLETE_LIFECYCLE_ACTION_DELAY_SECONDS"
deleteSqsMsgIfNodeNotFoundKey = "DELETE_SQS_MSG_IF_NODE_NOT_FOUND"
// heartbeat
heartbeatIntervalKey = "HEARTBEAT_INTERVAL"
heartbeatUntilKey = "HEARTBEAT_UNTIL"
)

// Config arguments set via CLI, environment variables, or defaults
Expand Down Expand Up @@ -166,6 +169,8 @@ type Config struct {
CompleteLifecycleActionDelaySeconds int
DeleteSqsMsgIfNodeNotFound bool
UseAPIServerCacheToListPods bool
HeartbeatInterval int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have some coverage around these newly added configs in config-test.go file...

HeartbeatUntil int
}

// ParseCliArgs parses cli arguments and uses environment variables as fallback values
Expand Down Expand Up @@ -230,6 +235,8 @@ func ParseCliArgs() (config Config, err error) {
flag.IntVar(&config.CompleteLifecycleActionDelaySeconds, "complete-lifecycle-action-delay-seconds", getIntEnv(completeLifecycleActionDelaySecondsKey, -1), "Delay completing the Autoscaling lifecycle action after a node has been drained.")
flag.BoolVar(&config.DeleteSqsMsgIfNodeNotFound, "delete-sqs-msg-if-node-not-found", getBoolEnv(deleteSqsMsgIfNodeNotFoundKey, false), "If true, delete SQS Messages from the SQS Queue if the targeted node(s) are not found.")
flag.BoolVar(&config.UseAPIServerCacheToListPods, "use-apiserver-cache", getBoolEnv(useAPIServerCache, false), "If true, leverage the k8s apiserver's index on pod's spec.nodeName to list pods on a node, instead of doing an etcd quorum read.")
flag.IntVar(&config.HeartbeatInterval, "heartbeat-interval", getIntEnv(heartbeatIntervalKey, -1), "The time period in seconds between consecutive heartbeat signals. Valid range: 30-3600 seconds (30 seconds to 1 hour).")
flag.IntVar(&config.HeartbeatUntil, "heartbeat-until", getIntEnv(heartbeatUntilKey, -1), "The duration in seconds over which heartbeat signals are sent. Valid range: 60-172800 seconds (1 minute to 48 hours).")
flag.Parse()

if isConfigProvided("pod-termination-grace-period", podTerminationGracePeriodConfigKey) && isConfigProvided("grace-period", gracePeriodConfigKey) {
Expand Down Expand Up @@ -274,6 +281,27 @@ func ParseCliArgs() (config Config, err error) {
panic("You must provide a node-name to the CLI or NODE_NAME environment variable.")
}

// heartbeat value boundary and compability check
if !config.EnableSQSTerminationDraining && (config.HeartbeatInterval != -1 || config.HeartbeatUntil != -1) {
return config, fmt.Errorf("currently using IMDS mode. Heartbeat is only supported for Queue Processor mode")
}
if config.HeartbeatInterval != -1 && (config.HeartbeatInterval < 30 || config.HeartbeatInterval > 3600) {
return config, fmt.Errorf("invalid heartbeat-interval passed: %d Should be between 30 and 3600 seconds", config.HeartbeatInterval)
}
if config.HeartbeatUntil != -1 && (config.HeartbeatUntil < 60 || config.HeartbeatUntil > 172800) {
return config, fmt.Errorf("invalid heartbeat-until passed: %d Should be between 60 and 172800 seconds", config.HeartbeatUntil)
}
if config.HeartbeatInterval == -1 && config.HeartbeatUntil != -1 {
return config, fmt.Errorf("invalid heartbeat configuration: heartbeat-interval is required when heartbeat-until is set")
}
if config.HeartbeatInterval != -1 && config.HeartbeatUntil == -1 {
config.HeartbeatUntil = 172800
log.Info().Msgf("Since heartbeat-until is not set, defaulting to %d seconds", config.HeartbeatUntil)
}
if config.HeartbeatInterval != -1 && config.HeartbeatUntil != -1 && config.HeartbeatInterval > config.HeartbeatUntil {
return config, fmt.Errorf("invalid heartbeat configuration: heartbeat-interval should be less than or equal to heartbeat-until")
}

// client-go expects these to be set in env vars
os.Setenv(kubernetesServiceHostConfigKey, config.KubernetesServiceHost)
os.Setenv(kubernetesServicePortConfigKey, config.KubernetesServicePort)
Expand Down Expand Up @@ -332,6 +360,8 @@ func (c Config) PrintJsonConfigArgs() {
Str("ManagedTag", c.ManagedTag).
Bool("use_provider_id", c.UseProviderId).
Bool("use_apiserver_cache", c.UseAPIServerCacheToListPods).
Int("heartbeat_interval", c.HeartbeatInterval).
Int("heartbeat_until", c.HeartbeatUntil).
Msg("aws-node-termination-handler arguments")
}

Expand Down Expand Up @@ -383,7 +413,9 @@ func (c Config) PrintHumanConfigArgs() {
"\tmanaged-tag: %s,\n"+
"\tuse-provider-id: %t,\n"+
"\taws-endpoint: %s,\n"+
"\tuse-apiserver-cache: %t,\n",
"\tuse-apiserver-cache: %t,\n"+
"\theartbeat-interval: %d,\n"+
"\theartbeat-until: %d\n",
c.DryRun,
c.NodeName,
c.PodName,
Expand Down Expand Up @@ -424,6 +456,8 @@ func (c Config) PrintHumanConfigArgs() {
c.UseProviderId,
c.AWSEndpoint,
c.UseAPIServerCacheToListPods,
c.HeartbeatInterval,
c.HeartbeatUntil,
)
}

Expand Down
23 changes: 19 additions & 4 deletions pkg/config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ func TestParseCliArgsEnvSuccess(t *testing.T) {
t.Setenv("ENABLE_SCHEDULED_EVENT_DRAINING", "true")
t.Setenv("ENABLE_SPOT_INTERRUPTION_DRAINING", "false")
t.Setenv("ENABLE_ASG_LIFECYCLE_DRAINING", "false")
t.Setenv("ENABLE_SQS_TERMINATION_DRAINING", "false")
t.Setenv("ENABLE_SQS_TERMINATION_DRAINING", "true")
t.Setenv("ENABLE_REBALANCE_MONITORING", "true")
t.Setenv("ENABLE_REBALANCE_DRAINING", "true")
t.Setenv("GRACE_PERIOD", "12345")
Expand All @@ -54,6 +54,8 @@ func TestParseCliArgsEnvSuccess(t *testing.T) {
t.Setenv("METADATA_TRIES", "100")
t.Setenv("CORDON_ONLY", "false")
t.Setenv("USE_APISERVER_CACHE", "true")
t.Setenv("HEARTBEAT_INTERVAL", "30")
t.Setenv("HEARTBEAT_UNTIL", "60")
nthConfig, err := config.ParseCliArgs()
h.Ok(t, err)

Expand All @@ -64,7 +66,7 @@ func TestParseCliArgsEnvSuccess(t *testing.T) {
h.Equals(t, true, nthConfig.EnableScheduledEventDraining)
h.Equals(t, false, nthConfig.EnableSpotInterruptionDraining)
h.Equals(t, false, nthConfig.EnableASGLifecycleDraining)
h.Equals(t, false, nthConfig.EnableSQSTerminationDraining)
h.Equals(t, true, nthConfig.EnableSQSTerminationDraining)
h.Equals(t, true, nthConfig.EnableRebalanceMonitoring)
h.Equals(t, true, nthConfig.EnableRebalanceDraining)
h.Equals(t, false, nthConfig.IgnoreDaemonSets)
Expand All @@ -80,6 +82,8 @@ func TestParseCliArgsEnvSuccess(t *testing.T) {
h.Equals(t, 100, nthConfig.MetadataTries)
h.Equals(t, false, nthConfig.CordonOnly)
h.Equals(t, true, nthConfig.UseAPIServerCacheToListPods)
h.Equals(t, 30, nthConfig.HeartbeatInterval)
h.Equals(t, 60, nthConfig.HeartbeatUntil)

// Check that env vars were set
value, ok := os.LookupEnv("KUBERNETES_SERVICE_HOST")
Expand All @@ -101,7 +105,7 @@ func TestParseCliArgsSuccess(t *testing.T) {
"--enable-scheduled-event-draining=true",
"--enable-spot-interruption-draining=false",
"--enable-asg-lifecycle-draining=false",
"--enable-sqs-termination-draining=false",
"--enable-sqs-termination-draining=true",
"--enable-rebalance-monitoring=true",
"--enable-rebalance-draining=true",
"--ignore-daemon-sets=false",
Expand All @@ -117,6 +121,8 @@ func TestParseCliArgsSuccess(t *testing.T) {
"--metadata-tries=100",
"--cordon-only=false",
"--use-apiserver-cache=true",
"--heartbeat-interval=30",
"--heartbeat-until=60",
}
nthConfig, err := config.ParseCliArgs()
h.Ok(t, err)
Expand All @@ -128,7 +134,7 @@ func TestParseCliArgsSuccess(t *testing.T) {
h.Equals(t, true, nthConfig.EnableScheduledEventDraining)
h.Equals(t, false, nthConfig.EnableSpotInterruptionDraining)
h.Equals(t, false, nthConfig.EnableASGLifecycleDraining)
h.Equals(t, false, nthConfig.EnableSQSTerminationDraining)
h.Equals(t, true, nthConfig.EnableSQSTerminationDraining)
h.Equals(t, true, nthConfig.EnableRebalanceMonitoring)
h.Equals(t, true, nthConfig.EnableRebalanceDraining)
h.Equals(t, false, nthConfig.IgnoreDaemonSets)
Expand All @@ -145,6 +151,8 @@ func TestParseCliArgsSuccess(t *testing.T) {
h.Equals(t, false, nthConfig.CordonOnly)
h.Equals(t, false, nthConfig.EnablePrometheus)
h.Equals(t, true, nthConfig.UseAPIServerCacheToListPods)
h.Equals(t, 30, nthConfig.HeartbeatInterval)
h.Equals(t, 60, nthConfig.HeartbeatUntil)

// Check that env vars were set
value, ok := os.LookupEnv("KUBERNETES_SERVICE_HOST")
Expand Down Expand Up @@ -176,6 +184,9 @@ func TestParseCliArgsOverrides(t *testing.T) {
t.Setenv("WEBHOOK_TEMPLATE", "no")
t.Setenv("METADATA_TRIES", "100")
t.Setenv("CORDON_ONLY", "true")
t.Setenv("HEARTBEAT_INTERVAL", "3601")
t.Setenv("HEARTBEAT_UNTIL", "172801")

os.Args = []string{
"cmd",
"--use-provider-id=false",
Expand All @@ -201,6 +212,8 @@ func TestParseCliArgsOverrides(t *testing.T) {
"--cordon-only=false",
"--enable-prometheus-server=true",
"--prometheus-server-port=2112",
"--heartbeat-interval=3600",
"--heartbeat-until=172800",
}
nthConfig, err := config.ParseCliArgs()
h.Ok(t, err)
Expand Down Expand Up @@ -229,6 +242,8 @@ func TestParseCliArgsOverrides(t *testing.T) {
h.Equals(t, false, nthConfig.CordonOnly)
h.Equals(t, true, nthConfig.EnablePrometheus)
h.Equals(t, 2112, nthConfig.PrometheusPort)
h.Equals(t, 3600, nthConfig.HeartbeatInterval)
h.Equals(t, 172800, nthConfig.HeartbeatUntil)

// Check that env vars were set
value, ok := os.LookupEnv("KUBERNETES_SERVICE_HOST")
Expand Down
Loading
Loading