Skip to content

Commit 33c6238

Browse files
authored
Support ASG-to-SQS Lifecycle Termination events (#568)
* Support ASG->SQS lifecycle termination events * Update README with instructions for ASG->SQS direct lifecycle term events * Improve readability and naming * Rearrange QP setup instructions for clarity * Update logging to clarify unsupported cases
1 parent 6335fe6 commit 33c6238

File tree

4 files changed

+146
-40
lines changed

4 files changed

+146
-40
lines changed

README.md

Lines changed: 54 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -194,45 +194,12 @@ For a full list of configuration options see our [Helm readme](https://github.co
194194

195195
The termination handler deployment requires some infrastructure to be setup before deploying the application. You'll need the following AWS infrastructure components:
196196

197-
1. AutoScaling Group Termination Lifecycle Hook
198-
2. Amazon Simple Queue Service (SQS) Queue
197+
1. Amazon Simple Queue Service (SQS) Queue
198+
2. AutoScaling Group Termination Lifecycle Hook
199199
3. Amazon EventBridge Rule
200200
4. IAM Role for the aws-node-termination-handler Queue Processing Pods
201201

202-
#### 1. Setup a Termination Lifecycle Hook on an ASG:
203-
204-
Here is the AWS CLI command to create a termination lifecycle hook on an existing ASG, although this should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:
205-
206-
```
207-
$ aws autoscaling put-lifecycle-hook \
208-
--lifecycle-hook-name=my-k8s-term-hook \
209-
--auto-scaling-group-name=my-k8s-asg \
210-
--lifecycle-transition=autoscaling:EC2_INSTANCE_TERMINATING \
211-
--default-result=CONTINUE \
212-
--heartbeat-timeout=300
213-
```
214-
215-
#### 2. Tag the ASGs:
216-
217-
By default the aws-node-termination-handler will only manage terminations for ASGs tagged w/ `key=aws-node-termination-handler/managed`
218-
219-
```
220-
$ aws autoscaling create-or-update-tags \
221-
--tags ResourceId=my-auto-scaling-group,ResourceType=auto-scaling-group,Key=aws-node-termination-handler/managed,Value=,PropagateAtLaunch=true
222-
```
223-
224-
The value of the key does not matter.
225-
226-
This functionality is helpful in accounts where there are ASGs that do not run kubernetes nodes or you do not want aws-node-termination-handler to manage their termination lifecycle.
227-
However, if your account is dedicated to ASGs for your kubernetes cluster, then you can turn off the ASG tag check by setting the flag `--check-asg-tag-before-draining=false` or environment variable `CHECK_ASG_TAG_BEFORE_DRAINING=false`.
228-
229-
You can also control what resources NTH manages by adding the resource ARNs to your Amazon EventBridge rules.
230-
231-
Take a look at the docs on how to create rules that only manage certain ASGs [here](https://docs.aws.amazon.com/autoscaling/ec2/userguide/cloud-watch-events.html).
232-
233-
See all the different events docs [here](https://docs.aws.amazon.com/eventbridge/latest/userguide/event-types.html#auto-scaling-event-types).
234-
235-
#### 3. Create an SQS Queue:
202+
#### 1. Create an SQS Queue:
236203

237204
Here is the AWS CLI command to create an SQS queue to hold termination events from ASG and EC2, although this should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:
238205

@@ -270,8 +237,59 @@ EOF
270237
$ aws sqs create-queue --queue-name "${SQS_QUEUE_NAME}" --attributes file:///tmp/queue-attributes.json
271238
```
272239

240+
If you are sending Lifecycle termination events from ASG directly to SQS, instead of through EventBridge, then you will also need to create an IAM service role to give Amazon EC2 Auto Scaling access to your SQS queue. Please follow [these linked instructions to create the IAM service role: link.](https://docs.aws.amazon.com/autoscaling/ec2/userguide/configuring-lifecycle-hook-notifications.html#sqs-notifications)
241+
Note the ARNs for the SQS queue and the associated IAM role for Step 2.
242+
243+
#### 2. Setup a Termination Lifecycle Hook on an ASG:
244+
245+
Here is the AWS CLI command to create a termination lifecycle hook on an existing ASG when using EventBridge, although this should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:
246+
247+
```
248+
$ aws autoscaling put-lifecycle-hook \
249+
--lifecycle-hook-name=my-k8s-term-hook \
250+
--auto-scaling-group-name=my-k8s-asg \
251+
--lifecycle-transition=autoscaling:EC2_INSTANCE_TERMINATING \
252+
--default-result=CONTINUE \
253+
--heartbeat-timeout=300
254+
```
255+
256+
If you want to avoid using EventBridge and instead send ASG Lifecycle events directly to SQS, instead use the following command, using the ARNs from Step 1:
257+
258+
```
259+
$ aws autoscaling put-lifecycle-hook \
260+
--lifecycle-hook-name=my-k8s-term-hook \
261+
--auto-scaling-group-name=my-k8s-asg \
262+
--lifecycle-transition=autoscaling:EC2_INSTANCE_TERMINATING \
263+
--default-result=CONTINUE \
264+
--heartbeat-timeout=300 \
265+
--notification-target-arn <your test queue ARN here> \
266+
--role-arn <your SQS access role ARN here>
267+
```
268+
269+
#### 3. Tag the ASGs:
270+
271+
By default the aws-node-termination-handler will only manage terminations for ASGs tagged w/ `key=aws-node-termination-handler/managed`
272+
273+
```
274+
$ aws autoscaling create-or-update-tags \
275+
--tags ResourceId=my-auto-scaling-group,ResourceType=auto-scaling-group,Key=aws-node-termination-handler/managed,Value=,PropagateAtLaunch=true
276+
```
277+
278+
The value of the key does not matter.
279+
280+
This functionality is helpful in accounts where there are ASGs that do not run kubernetes nodes or you do not want aws-node-termination-handler to manage their termination lifecycle.
281+
However, if your account is dedicated to ASGs for your kubernetes cluster, then you can turn off the ASG tag check by setting the flag `--check-asg-tag-before-draining=false` or environment variable `CHECK_ASG_TAG_BEFORE_DRAINING=false`.
282+
283+
You can also control what resources NTH manages by adding the resource ARNs to your Amazon EventBridge rules.
284+
285+
Take a look at the docs on how to create rules that only manage certain ASGs [here](https://docs.aws.amazon.com/autoscaling/ec2/userguide/cloud-watch-events.html).
286+
287+
See all the different events docs [here](https://docs.aws.amazon.com/eventbridge/latest/userguide/event-types.html#auto-scaling-event-types).
288+
273289
#### 4. Create Amazon EventBridge Rules
274290

291+
You may skip this step if sending events from ASG to SQS directly.
292+
275293
Here are AWS CLI commands to create Amazon EventBridge rules so that ASG termination events, Spot Interruptions, Instance state changes, Rebalance Recommendations, and AWS Health Scheduled Changes are sent to the SQS queue created in the previous step. This should really be configured via your favorite infrastructure-as-code tool like CloudFormation or Terraform:
276294

277295
```

pkg/monitor/sqsevent/asg-lifecycle-event.go

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,11 @@ import (
3232
"id": "782d5b4c-0f6f-1fd6-9d62-ecf6aed0a470",
3333
"detail-type": "EC2 Instance-terminate Lifecycle Action",
3434
"source": "aws.autoscaling",
35-
"account": "896453262834",
35+
"account": "123456789012",
3636
"time": "2020-07-01T22:19:58Z",
3737
"region": "us-east-1",
3838
"resources": [
39-
"arn:aws:autoscaling:us-east-1:896453262834:autoScalingGroup:26e7234b-03a4-47fb-b0a9-2b241662774e:autoScalingGroupName/testt1.demo-0a20f32c.kops.sh"
39+
"arn:aws:autoscaling:us-east-1:123456789012:autoScalingGroup:26e7234b-03a4-47fb-b0a9-2b241662774e:autoScalingGroupName/testt1.demo-0a20f32c.kops.sh"
4040
],
4141
"detail": {
4242
"LifecycleActionToken": "0befcbdb-6ecd-498a-9ff7-ae9b54447cd6",
@@ -55,6 +55,8 @@ type LifecycleDetail struct {
5555
LifecycleHookName string `json:"LifecycleHookName"`
5656
EC2InstanceID string `json:"EC2InstanceId"`
5757
LifecycleTransition string `json:"LifecycleTransition"`
58+
RequestID string `json:"RequestId"`
59+
Time string `json:"Time"`
5860
}
5961

6062
func (m SQSMonitor) asgTerminationToInterruptionEvent(event *EventBridgeEvent, message *sqs.Message) (*monitor.InterruptionEvent, error) {

pkg/monitor/sqsevent/sqs-monitor.go

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,9 +101,38 @@ func (m SQSMonitor) processSQSMessage(message *sqs.Message) (*EventBridgeEvent,
101101
event := EventBridgeEvent{}
102102
err := json.Unmarshal([]byte(*message.Body), &event)
103103

104+
if err != nil {
105+
return &event, err
106+
}
107+
108+
if len(event.DetailType) == 0 {
109+
event, err = m.processLifecycleEventFromASG(message)
110+
}
111+
104112
return &event, err
105113
}
106114

115+
// processLifecycleEventFromASG checks for a Lifecycle event from ASG to SQS, and wraps it in an EventBridgeEvent
116+
func (m SQSMonitor) processLifecycleEventFromASG(message *sqs.Message) (EventBridgeEvent, error) {
117+
eventBridgeEvent := EventBridgeEvent{}
118+
lifecycleEvent := LifecycleDetail{}
119+
err := json.Unmarshal([]byte(*message.Body), &lifecycleEvent)
120+
121+
if err != nil || lifecycleEvent.LifecycleTransition != "autoscaling:EC2_INSTANCE_TERMINATING" {
122+
log.Err(err).Msg("only lifecycle termination events from ASG to SQS are supported outside EventBridge")
123+
err = fmt.Errorf("unsupported message type (%s)", message.String())
124+
return eventBridgeEvent, err
125+
}
126+
127+
eventBridgeEvent.Source = "aws.autoscaling"
128+
eventBridgeEvent.Time = lifecycleEvent.Time
129+
eventBridgeEvent.ID = lifecycleEvent.RequestID
130+
eventBridgeEvent.Detail, err = json.Marshal(lifecycleEvent)
131+
132+
log.Debug().Msg("processing lifecycle termination event from ASG")
133+
return eventBridgeEvent, err
134+
}
135+
107136
// processEventBridgeEvent processes an EventBridge event and returns interruption event wrappers
108137
func (m SQSMonitor) processEventBridgeEvent(eventBridgeEvent *EventBridgeEvent, message *sqs.Message) []InterruptionEventWrapper {
109138
interruptionEventWrappers := []InterruptionEventWrapper{}
@@ -150,7 +179,7 @@ func (m SQSMonitor) processInterruptionEvents(interruptionEventWrappers []Interr
150179
case eventWrapper.Err != nil:
151180
// Log errors and record as failed events. Don't delete the message in order to allow retries
152181
log.Err(eventWrapper.Err).Msg("ignoring interruption event due to error")
153-
failedInterruptionEventsCount++ // seems useless
182+
failedInterruptionEventsCount++
154183

155184
case eventWrapper.InterruptionEvent == nil:
156185
log.Debug().Msg("dropping non-actionable interruption event")

pkg/monitor/sqsevent/sqs-monitor_test.go

Lines changed: 58 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,16 @@ var asgLifecycleEvent = sqsevent.EventBridgeEvent{
6767
}`),
6868
}
6969

70+
var asgLifecycleEventFromSQS = sqsevent.LifecycleDetail{
71+
LifecycleHookName: "test-nth-asg-to-sqs",
72+
RequestID: "3775fac9-93c3-7ead-8713-159816566000",
73+
LifecycleTransition: "autoscaling:EC2_INSTANCE_TERMINATING",
74+
AutoScalingGroupName: "my-asg",
75+
Time: "2022-01-31T23:07:47.872Z",
76+
EC2InstanceID: "i-040107f6ba000e5ee",
77+
LifecycleActionToken: "b4dd0f5b-0ef2-4479-9dad-6c55f027000e",
78+
}
79+
7080
var rebalanceRecommendationEvent = sqsevent.EventBridgeEvent{
7181
Version: "0",
7282
ID: "5d5555d5-dd55-5555-5555-5555dd55d55d",
@@ -87,7 +97,7 @@ func TestKind(t *testing.T) {
8797
h.Assert(t, sqsevent.SQSMonitor{}.Kind() == sqsevent.SQSTerminateKind, "SQSMonitor kind should return the kind constant for the event")
8898
}
8999

90-
func TestMonitor_Success(t *testing.T) {
100+
func TestMonitor_EventBridgeSuccess(t *testing.T) {
91101
spotItnEventNoTime := spotItnEvent
92102
spotItnEventNoTime.Time = ""
93103
for _, event := range []sqsevent.EventBridgeEvent{spotItnEvent, asgLifecycleEvent, spotItnEventNoTime, rebalanceRecommendationEvent} {
@@ -134,6 +144,53 @@ func TestMonitor_Success(t *testing.T) {
134144
}
135145
}
136146

147+
func TestMonitor_AsgDirectToSqsSuccess(t *testing.T) {
148+
event := asgLifecycleEventFromSQS
149+
eventBytes, err := json.Marshal(&event)
150+
h.Ok(t, err)
151+
eventStr := string(eventBytes)
152+
msg := sqs.Message{Body: &eventStr}
153+
h.Ok(t, err)
154+
messages := []*sqs.Message{
155+
&msg,
156+
}
157+
sqsMock := h.MockedSQS{
158+
ReceiveMessageResp: sqs.ReceiveMessageOutput{Messages: messages},
159+
ReceiveMessageErr: nil,
160+
}
161+
dnsNodeName := "ip-10-0-0-157.us-east-2.compute.internal"
162+
ec2Mock := h.MockedEC2{
163+
DescribeInstancesResp: getDescribeInstancesResp(dnsNodeName, true, true),
164+
}
165+
drainChan := make(chan monitor.InterruptionEvent, 1)
166+
167+
sqsMonitor := sqsevent.SQSMonitor{
168+
SQS: sqsMock,
169+
EC2: ec2Mock,
170+
ManagedAsgTag: "aws-node-termination-handler/managed",
171+
ASG: mockIsManagedTrue(nil),
172+
CheckIfManaged: true,
173+
QueueURL: "https://test-queue",
174+
InterruptionChan: drainChan,
175+
}
176+
177+
err = sqsMonitor.Monitor()
178+
h.Ok(t, err)
179+
180+
select {
181+
case result := <-drainChan:
182+
h.Equals(t, sqsevent.SQSTerminateKind, result.Kind)
183+
h.Equals(t, result.NodeName, dnsNodeName)
184+
h.Assert(t, result.PostDrainTask != nil, "PostDrainTask should have been set")
185+
h.Assert(t, result.PreDrainTask != nil, "PreDrainTask should have been set")
186+
err = result.PostDrainTask(result, node.Node{})
187+
h.Ok(t, err)
188+
default:
189+
h.Ok(t, fmt.Errorf("Expected an event to be generated"))
190+
}
191+
192+
}
193+
137194
func TestMonitor_DrainTasks(t *testing.T) {
138195
testEvents := []sqsevent.EventBridgeEvent{spotItnEvent, asgLifecycleEvent, rebalanceRecommendationEvent}
139196
messages := make([]*sqs.Message, 0, len(testEvents))

0 commit comments

Comments
 (0)