Document EventBridge rules with resource filter based on ASG tags #536

stevehipwell · 2021-11-18T17:54:13Z

Describe the feature
When using NTH in queue mode we need to create EventBridge rules to match our resources. The examples don't have any filters but this isn't going to work in a real world account. I'd like to see documentation for how the rules can be filtered based on ASG tags so we can filter resources from many ASGs with a single rule.

Is the feature request related to a problem?
When using resources as a filter the rule reaches it's max size before all ASGs can be monitored.

Describe alternatives you've considered
I've created a rule per ASG.

The text was updated successfully, but these errors were encountered:

stevehipwell · 2021-11-18T17:55:08Z

@bwagner5 the v2 discussions reminded me about this issue.

bwagner5 · 2021-11-18T23:55:52Z

I do not believe it is currently possible to specify an EventBridge ASG source by tag, only ASG names. ASG prefix may not be optimal depending on how the infra is setup, but it's at least better than individual names: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns-content-based-filtering.html#eb-filtering-prefix-matching

stevehipwell · 2021-11-19T06:43:54Z

Do you know if it's possible to use a wildcard in the name?

bwagner5 · 2021-11-19T15:49:14Z

I don't think so. I believe it's only prefix for strings.

stevehipwell · 2021-11-19T15:58:03Z

Does the ASG name come through in the event? All our ASGs for a cluster have the same prefix.

bwagner5 · 2021-11-19T15:59:39Z

yep, ASG Name is there https://docs.aws.amazon.com/autoscaling/ec2/userguide/cloud-watch-events.html#terminate-lifecycle-action

stevehipwell · 2021-11-19T16:04:59Z

What about for spot terminations and rebalance events?

stevehipwell · 2021-11-19T16:20:27Z

@bwagner5 I'm pretty sure I've got the rules wrong for my spot notifications, as I've got the ASG ARN as a resource filter based on the event patterns in the doc you linked above. Does NTH check if the node is in K8s before it evaluates the ASG tag or does it check the tag first? Basically should the tag be unique to the cluster if I might have multiple clusters in an account?

bwagner5 · 2021-11-19T16:50:17Z

Spot Termination and Rebalance events do not have the ASG in them, only ASG event do. Spot and Rebalance events work outside of ASG which is why they don't have ASG context associated with them.

This is a good discussion that we need to update docs on!

If you are using ASG w/ capacity-rebalance enabled on the ASG, then you do not need Spot and Rebalance events enabled w/ EventBridge.

ASG will send a termination lifecycle hook for spot interruptions while it's launching a new instance.

ASG will send a termination lifecycle hook for rebalance events after it brings up a new node in the ASG.

If you do not have capacity-rebalance enabled on the ASG, then spot interruptions will cause a termination lifecycle hook as the interruption comes in, not while it's launching the new instance.

stevehipwell · 2021-11-22T09:11:12Z

@bwagner5 could you give me an example of what the infrastructure should look like using capacity-rebalance?

I'd also be interested in the optimal way to configure ASG spot pools and options for EKS? Basically lowest-price vs capacity-optimized and the number of pools to configure per instance type. I've currently left this on the defaults (for the terraform-aws-eks TF module) but am happy to change this to work better with capacity-rebalance.

I also assume that in a scenario using spot termination where there are multiple clusters in an account region (region can be filtered in the rule) that the ASG tag needs to be unique to the cluster?

This is my refactored configuration in Terraform, using instance refresh and spot terminations.

resource "aws_autoscaling_lifecycle_hook" "default" {
  count = length(local.asg_ids)

  name                   = "aws-node-termination-handler"
  autoscaling_group_name = local.asg_ids[count.index]
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  heartbeat_timeout      = 600
  default_result         = "CONTINUE"
}

resource "aws_cloudwatch_event_rule" "asg" {
  name = "${var.cluster_name}-asg-termination"

  event_pattern = jsonencode(
    {
      "source" : [
        "aws.autoscaling"
      ]
      "detail-type" : [
        "EC2 Instance-terminate Lifecycle Action"
      ],
      "region" : [var.region]
      "detail" : {
        "AutoScalingGroupName" : [{ "prefix" : var.cluster_name }]
      }
    }
  )

  tags = var.tags
}

resource "aws_cloudwatch_event_target" "asg" {
  target_id = "1"
  rule      = aws_cloudwatch_event_rule.asg.name
  arn       = aws_sqs_queue.default.arn
}

resource "aws_cloudwatch_event_rule" "spot" {
  name = "${var.cluster_name}-spot-termination"

  event_pattern = jsonencode(
    {
      "source" : [
        "aws.ec2"
      ]
      "detail-type" : [
        "EC2 Spot Instance Interruption Warning"
      ]
      "region" : [var.region]
    }
  )

  tags = var.tags
}

resource "aws_cloudwatch_event_target" "spot" {
  target_id = "1"
  rule      = aws_cloudwatch_event_rule.spot.name
  arn       = aws_sqs_queue.default.arn
}

bwagner5 · 2021-11-22T15:25:31Z

could you give me an example of what the infrastructure should look like using capacity-rebalance?

I'd also be interested in the optimal way to configure ASG spot pools and options for EKS? Basically lowest-price vs capacity-optimized and the number of pools to configure per instance type. I've currently left this on the defaults (for the terraform-aws-eks TF module) but am happy to change this to work better with capacity-rebalance.

If you're using capacity-rebalance on an ASG, then you should never use the lowest-price allocation strategy, always capacity-optimized. Using lowest-price w/ capacity-rebalance can cause a lot of churn.

When using cluster-autoscaler, you'll need each of your ASGs to be a similar instance shape and increase the number of ASGs you operate with (Karpenter doesn't suffer from this limitation :) ). We recommend to provide as many instance pools as you can that match a similar shape. We have a tool that helps: https://github.com/aws/amazon-ec2-instance-selector . Generally, 3-4 pools is pretty good though.

I also assume that in a scenario using spot termination where there are multiple clusters in an account region (region can be filtered in the rule) that the ASG tag needs to be unique to the cluster?

Yes, that is correct.

stevehipwell · 2021-11-22T16:17:24Z

@bwagner5 we have our ASGs in a good shape, the question is if we should be using a default of 10 pools or if we should be using a pool per instance type available to the ASG?

I think a switch to capacity-rebalance with capacity-optimized placement makes sense for us, especially if it means we can just watch the termination events to deal with spot instances being replaced. I take it that this would allow us to have longer than the default 120 seconds to deal with termination events?

stevehipwell · 2021-11-22T17:11:20Z

@bwagner5 never mind it looks like we need to set the pools value to 0 for capacity-optimized placement, not that the docs were any use as they give a very generic definition for pools and then fail to mention them again other than in circular references back to the original sparse definition.

stevehipwell · 2021-12-14T08:48:22Z

I take it that this would allow us to have longer than the default 120 seconds to deal with termination events?

@bwagner5 any advice on this? I'm not sure if it's related but we saw a node fail to terminate correctly which then resulted in a CSI driver failure to unmount/mount.

github-actions · 2022-01-13T17:14:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

stevehipwell · 2022-01-13T17:50:08Z

/not-stale

github-actions · 2022-02-14T17:08:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

stevehipwell · 2022-02-14T17:09:18Z

/not-stale

cjerad · 2023-06-22T15:17:05Z

This has been released in v1.20.0, chart version 0.22.0

bwagner5 added the docs label Nov 19, 2021

github-actions bot added the stale Issues / PRs with no activity label Jan 13, 2022

github-actions bot removed the stale Issues / PRs with no activity label Jan 14, 2022

github-actions bot added the stale Issues / PRs with no activity label Feb 14, 2022

snay2 added stalebot-ignore To NOT let the stalebot update or close the Issue / PR and removed stale Issues / PRs with no activity labels Feb 14, 2022

jillmon assigned LikithaVemulapalli Jan 19, 2023

LikithaVemulapalli mentioned this issue Jan 25, 2023

Added detailed ASG information in README #759

Merged

cjerad added the Pending-Release Pending an NTH or eks-charts release label Jun 21, 2023

cjerad closed this as completed Jun 22, 2023

cjerad removed the Pending-Release Pending an NTH or eks-charts release label Jun 22, 2023

cjerad removed the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document EventBridge rules with resource filter based on ASG tags #536

Document EventBridge rules with resource filter based on ASG tags #536

stevehipwell commented Nov 18, 2021

stevehipwell commented Nov 18, 2021

bwagner5 commented Nov 18, 2021 •

edited

Loading

stevehipwell commented Nov 19, 2021

bwagner5 commented Nov 19, 2021

stevehipwell commented Nov 19, 2021

bwagner5 commented Nov 19, 2021

stevehipwell commented Nov 19, 2021

stevehipwell commented Nov 19, 2021

bwagner5 commented Nov 19, 2021

stevehipwell commented Nov 22, 2021 •

edited

Loading

bwagner5 commented Nov 22, 2021

stevehipwell commented Nov 22, 2021

stevehipwell commented Nov 22, 2021

stevehipwell commented Dec 14, 2021

github-actions bot commented Jan 13, 2022

stevehipwell commented Jan 13, 2022

github-actions bot commented Feb 14, 2022

stevehipwell commented Feb 14, 2022

cjerad commented Jun 22, 2023

Document EventBridge rules with resource filter based on ASG tags #536

Document EventBridge rules with resource filter based on ASG tags #536

Comments

stevehipwell commented Nov 18, 2021

stevehipwell commented Nov 18, 2021

bwagner5 commented Nov 18, 2021 • edited Loading

stevehipwell commented Nov 19, 2021

bwagner5 commented Nov 19, 2021

stevehipwell commented Nov 19, 2021

bwagner5 commented Nov 19, 2021

stevehipwell commented Nov 19, 2021

stevehipwell commented Nov 19, 2021

bwagner5 commented Nov 19, 2021

stevehipwell commented Nov 22, 2021 • edited Loading

bwagner5 commented Nov 22, 2021

stevehipwell commented Nov 22, 2021

stevehipwell commented Nov 22, 2021

stevehipwell commented Dec 14, 2021

github-actions bot commented Jan 13, 2022

stevehipwell commented Jan 13, 2022

github-actions bot commented Feb 14, 2022

stevehipwell commented Feb 14, 2022

cjerad commented Jun 22, 2023

bwagner5 commented Nov 18, 2021 •

edited

Loading

stevehipwell commented Nov 22, 2021 •

edited

Loading