Use EC2 Root Volume Replacement to replace macOS hosts #149

lyoung-confluent · 2024-02-25T20:02:41Z

Note

I have not tested this change, it is purely based off reading the AWS documentation.

Currently, this stack recycles (terminates) each EC2 instance after it has successfully executed a single CI job to ensure that each CI job has a fresh environment without any risk of broken tooling/configuration from a previous job/execution.

However, this pattern is problematic for macOS machines as EC2 Mac does not bill per-second:

Billing for EC2 Mac instances is per second with a 24-hour minimum allocation period to comply with the Apple macOS Software License Agreement

As a result, a CI job that takes only a few seconds or even a full hour (the default timeout) will be billed for 24 hours of compute time which can quickly become expensive.

Fortunately, there is a potential solution. EC2 has support for quickly restoring a instance to it’s original state: Quickly Restore Amazon EC2 Mac Instances using Replace Root Volume capability:

The second use case is during continuous integration and continuous deployment (CI/CD) when you need to restore an Amazon EC2 Mac instance to a defined well-known state at the end of a build.

To restart your EC2 Mac instance in its initial state without stopping or terminating them, we created the ability to replace the root volume of an Amazon EC2 Mac instance with another EBS volume. This new EBS volume is created either from a new AMI, an Amazon EBS Snapshot, or from the initial volume state during boot.

This PR attempts to implement this feature by adding the ec2:CreateReplaceRootVolumeTask IAM action/permission to the instance IAM role (utilizing the aws:userid condition trick to limit that action to only the instance itself making the request). Then, instead of using calling TerminateInstanceInAutoScalingGroup when an instance needs to be replaced, it will call CreateReplaceRootVolumeTask using it's own instance ID and AMI ID (extracted via the metadata service) to restart the machine with a fresh OS image.

I am a bit concerned about this note in the aws blog post:

During the replacement, the instance will be unable to respond to health checks and hence might be marked as unhealthy if placed inside an Auto Scaled Group. You can write a custom health check to change that behavior.

It's unclear if such a custom health check would be required for this use-case or if the instance will restart quickly enough. To avoid this, I've updated the code to put the instance temporarily in standby mode before the replacement starts which disables checking/enforcing the health check, then in start-agent.sh it exits standby.

lucaspin · 2024-03-06T12:38:02Z

@lyoung-confluent thanks for the pull request. This looks interesting. I'll try to run a few tests and see if we need to adjust anything here.

lyoung-confluent · 2024-03-20T17:28:30Z

@lucaspin Did you by chance get time to test this? I'm not setup to easily test this myself otherwise I would have

lucaspin · 2024-03-20T20:51:07Z

@lyoung-confluent unfortunately, not yet. I'll try to reserve a little bit of time next week for this.

lyoung-confluent · 2024-05-13T03:22:03Z

@lucaspin Checking in on this, any luck finding some time to test this PR?

lucaspin · 2024-05-13T12:19:25Z

@lyoung-confluent unfortunately, no. This completely slipped through me again. It might be best to create a support ticket for this one, just so we can more easily track and prioritize it alongside the other current things we have going on.

lyoung-confluent added 12 commits February 25, 2024 11:40

Update aws-semaphore-agent-stack.js

1a9b594

Update terminate-instance.sh

9eb2a5f

Update terminate-instance.sh

876697d

Update terminate-instance.sh

f27c1a8

Update start-agent.sh

769f37e

Update aws-semaphore-agent-stack.js

f692ea2

Update aws-semaphore-agent-stack.js

390d531

Update aws-semaphore-agent-stack.js

a74fccb

Update start-agent.sh

8ed30bd

Update terminate-instance.sh

979318e

Update aws-semaphore-agent-stack.js

8efc507

Update start-agent.sh

bdf0733

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use EC2 Root Volume Replacement to replace macOS hosts #149

Use EC2 Root Volume Replacement to replace macOS hosts #149

lyoung-confluent commented Feb 25, 2024 •

edited

Loading

lucaspin commented Mar 6, 2024

lyoung-confluent commented Mar 20, 2024

lucaspin commented Mar 20, 2024

lyoung-confluent commented May 13, 2024

lucaspin commented May 13, 2024

Use EC2 Root Volume Replacement to replace macOS hosts #149

Are you sure you want to change the base?

Use EC2 Root Volume Replacement to replace macOS hosts #149

Conversation

lyoung-confluent commented Feb 25, 2024 • edited Loading

lucaspin commented Mar 6, 2024

lyoung-confluent commented Mar 20, 2024

lucaspin commented Mar 20, 2024

lyoung-confluent commented May 13, 2024

lucaspin commented May 13, 2024

lyoung-confluent commented Feb 25, 2024 •

edited

Loading