Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document EventBridge rules with resource filter based on ASG tags #536

Closed
stevehipwell opened this issue Nov 18, 2021 · 19 comments
Closed
Assignees
Labels

Comments

@stevehipwell
Copy link
Contributor

Describe the feature
When using NTH in queue mode we need to create EventBridge rules to match our resources. The examples don't have any filters but this isn't going to work in a real world account. I'd like to see documentation for how the rules can be filtered based on ASG tags so we can filter resources from many ASGs with a single rule.

Is the feature request related to a problem?
When using resources as a filter the rule reaches it's max size before all ASGs can be monitored.

Describe alternatives you've considered
I've created a rule per ASG.

@stevehipwell
Copy link
Contributor Author

@bwagner5 the v2 discussions reminded me about this issue.

@bwagner5
Copy link
Contributor

bwagner5 commented Nov 18, 2021

I do not believe it is currently possible to specify an EventBridge ASG source by tag, only ASG names. ASG prefix may not be optimal depending on how the infra is setup, but it's at least better than individual names: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns-content-based-filtering.html#eb-filtering-prefix-matching

@bwagner5 bwagner5 added the docs label Nov 19, 2021
@stevehipwell
Copy link
Contributor Author

Do you know if it's possible to use a wildcard in the name?

@bwagner5
Copy link
Contributor

I don't think so. I believe it's only prefix for strings.

@stevehipwell
Copy link
Contributor Author

Does the ASG name come through in the event? All our ASGs for a cluster have the same prefix.

@bwagner5
Copy link
Contributor

@stevehipwell
Copy link
Contributor Author

What about for spot terminations and rebalance events?

@stevehipwell
Copy link
Contributor Author

@bwagner5 I'm pretty sure I've got the rules wrong for my spot notifications, as I've got the ASG ARN as a resource filter based on the event patterns in the doc you linked above. Does NTH check if the node is in K8s before it evaluates the ASG tag or does it check the tag first? Basically should the tag be unique to the cluster if I might have multiple clusters in an account?

@bwagner5
Copy link
Contributor

Spot Termination and Rebalance events do not have the ASG in them, only ASG event do. Spot and Rebalance events work outside of ASG which is why they don't have ASG context associated with them.

This is a good discussion that we need to update docs on!

If you are using ASG w/ capacity-rebalance enabled on the ASG, then you do not need Spot and Rebalance events enabled w/ EventBridge.

ASG will send a termination lifecycle hook for spot interruptions while it's launching a new instance.

ASG will send a termination lifecycle hook for rebalance events after it brings up a new node in the ASG.

If you do not have capacity-rebalance enabled on the ASG, then spot interruptions will cause a termination lifecycle hook as the interruption comes in, not while it's launching the new instance.

@stevehipwell
Copy link
Contributor Author

stevehipwell commented Nov 22, 2021

@bwagner5 could you give me an example of what the infrastructure should look like using capacity-rebalance?

I'd also be interested in the optimal way to configure ASG spot pools and options for EKS? Basically lowest-price vs capacity-optimized and the number of pools to configure per instance type. I've currently left this on the defaults (for the terraform-aws-eks TF module) but am happy to change this to work better with capacity-rebalance.

I also assume that in a scenario using spot termination where there are multiple clusters in an account region (region can be filtered in the rule) that the ASG tag needs to be unique to the cluster?

This is my refactored configuration in Terraform, using instance refresh and spot terminations.

resource "aws_autoscaling_lifecycle_hook" "default" {
  count = length(local.asg_ids)

  name                   = "aws-node-termination-handler"
  autoscaling_group_name = local.asg_ids[count.index]
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  heartbeat_timeout      = 600
  default_result         = "CONTINUE"
}

resource "aws_cloudwatch_event_rule" "asg" {
  name = "${var.cluster_name}-asg-termination"

  event_pattern = jsonencode(
    {
      "source" : [
        "aws.autoscaling"
      ]
      "detail-type" : [
        "EC2 Instance-terminate Lifecycle Action"
      ],
      "region" : [var.region]
      "detail" : {
        "AutoScalingGroupName" : [{ "prefix" : var.cluster_name }]
      }
    }
  )

  tags = var.tags
}

resource "aws_cloudwatch_event_target" "asg" {
  target_id = "1"
  rule      = aws_cloudwatch_event_rule.asg.name
  arn       = aws_sqs_queue.default.arn
}

resource "aws_cloudwatch_event_rule" "spot" {
  name = "${var.cluster_name}-spot-termination"

  event_pattern = jsonencode(
    {
      "source" : [
        "aws.ec2"
      ]
      "detail-type" : [
        "EC2 Spot Instance Interruption Warning"
      ]
      "region" : [var.region]
    }
  )

  tags = var.tags
}

resource "aws_cloudwatch_event_target" "spot" {
  target_id = "1"
  rule      = aws_cloudwatch_event_rule.spot.name
  arn       = aws_sqs_queue.default.arn
}

@bwagner5
Copy link
Contributor

could you give me an example of what the infrastructure should look like using capacity-rebalance?

I'd also be interested in the optimal way to configure ASG spot pools and options for EKS? Basically lowest-price vs capacity-optimized and the number of pools to configure per instance type. I've currently left this on the defaults (for the terraform-aws-eks TF module) but am happy to change this to work better with capacity-rebalance.

If you're using capacity-rebalance on an ASG, then you should never use the lowest-price allocation strategy, always capacity-optimized. Using lowest-price w/ capacity-rebalance can cause a lot of churn.

When using cluster-autoscaler, you'll need each of your ASGs to be a similar instance shape and increase the number of ASGs you operate with (Karpenter doesn't suffer from this limitation :) ). We recommend to provide as many instance pools as you can that match a similar shape. We have a tool that helps: https://github.com/aws/amazon-ec2-instance-selector . Generally, 3-4 pools is pretty good though.

I also assume that in a scenario using spot termination where there are multiple clusters in an account region (region can be filtered in the rule) that the ASG tag needs to be unique to the cluster?

Yes, that is correct.

@stevehipwell
Copy link
Contributor Author

@bwagner5 we have our ASGs in a good shape, the question is if we should be using a default of 10 pools or if we should be using a pool per instance type available to the ASG?

I think a switch to capacity-rebalance with capacity-optimized placement makes sense for us, especially if it means we can just watch the termination events to deal with spot instances being replaced. I take it that this would allow us to have longer than the default 120 seconds to deal with termination events?

@stevehipwell
Copy link
Contributor Author

@bwagner5 never mind it looks like we need to set the pools value to 0 for capacity-optimized placement, not that the docs were any use as they give a very generic definition for pools and then fail to mention them again other than in circular references back to the original sparse definition.

@stevehipwell
Copy link
Contributor Author

I take it that this would allow us to have longer than the default 120 seconds to deal with termination events?

@bwagner5 any advice on this? I'm not sure if it's related but we saw a node fail to terminate correctly which then resulted in a CSI driver failure to unmount/mount.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Jan 13, 2022
@stevehipwell
Copy link
Contributor Author

/not-stale

@github-actions github-actions bot removed the stale Issues / PRs with no activity label Jan 14, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Feb 14, 2022
@stevehipwell
Copy link
Contributor Author

/not-stale

@snay2 snay2 added stalebot-ignore To NOT let the stalebot update or close the Issue / PR and removed stale Issues / PRs with no activity labels Feb 14, 2022
@cjerad cjerad added the Pending-Release Pending an NTH or eks-charts release label Jun 21, 2023
@cjerad
Copy link
Member

cjerad commented Jun 22, 2023

This has been released in v1.20.0, chart version 0.22.0

@cjerad cjerad closed this as completed Jun 22, 2023
@cjerad cjerad removed the Pending-Release Pending an NTH or eks-charts release label Jun 22, 2023
@cjerad cjerad removed the stalebot-ignore To NOT let the stalebot update or close the Issue / PR label Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants