-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use EC2 Root Volume Replacement to replace macOS hosts #149
base: master
Are you sure you want to change the base?
Conversation
@lyoung-confluent thanks for the pull request. This looks interesting. I'll try to run a few tests and see if we need to adjust anything here. |
@lucaspin Did you by chance get time to test this? I'm not setup to easily test this myself otherwise I would have |
@lyoung-confluent unfortunately, not yet. I'll try to reserve a little bit of time next week for this. |
@lucaspin Checking in on this, any luck finding some time to test this PR? |
@lyoung-confluent unfortunately, no. This completely slipped through me again. It might be best to create a support ticket for this one, just so we can more easily track and prioritize it alongside the other current things we have going on. |
Note
I have not tested this change, it is purely based off reading the AWS documentation.
Currently, this stack recycles (terminates) each EC2 instance after it has successfully executed a single CI job to ensure that each CI job has a fresh environment without any risk of broken tooling/configuration from a previous job/execution.
However, this pattern is problematic for macOS machines as EC2 Mac does not bill per-second:
As a result, a CI job that takes only a few seconds or even a full hour (the default timeout) will be billed for 24 hours of compute time which can quickly become expensive.
Fortunately, there is a potential solution. EC2 has support for quickly restoring a instance to it’s original state: Quickly Restore Amazon EC2 Mac Instances using Replace Root Volume capability:
This PR attempts to implement this feature by adding the
ec2:CreateReplaceRootVolumeTask
IAM action/permission to the instance IAM role (utilizing the aws:userid condition trick to limit that action to only the instance itself making the request). Then, instead of using callingTerminateInstanceInAutoScalingGroup
when an instance needs to be replaced, it will callCreateReplaceRootVolumeTask
using it's own instance ID and AMI ID (extracted via the metadata service) to restart the machine with a fresh OS image.I am a bit concerned about this note in the aws blog post:
It's unclear if such a custom health check would be required for this use-case or if the instance will restart quickly enough. To avoid this, I've updated the code to put the instance temporarily in standby mode before the replacement starts which disables checking/enforcing the health check, then in
start-agent.sh
it exits standby.