Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New pipeline: deregister old AMIs #1414

Merged
merged 4 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .buildkite/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,12 @@ services:
volumes:
- ..:/code:ro
command: go test -v ./...

ruby:
image: ruby:3.3
working_dir: /work
environment:
- DRY_RUN
- AWS_REGION
volumes:
- ..:/work:ro
19 changes: 19 additions & 0 deletions .buildkite/pipeline.cleanamis.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
steps:
- name: ":broom: Delete AMIs ({{matrix}})"
command: .buildkite/steps/clean-old-amis
agents:
queue: "oss-deploy"
env:
DRY_RUN: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, we're only dry running it. We'll build confidence the output looks like we expect, then remove this

AWS_REGION: "{{matrix}}"
matrix:
- "us-east-1"
- "us-west-2"
- "ap-southeast-2"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the other 17 regions in a follow up PR

plugins:
- aws-assume-role-with-web-identity#v1.1.0:
role-arn: arn:aws:iam::172840064832:role/pipeline-buildkite-elastic-stack-for-aws-ami-cleaner
- docker-compose#v5.4.1:
run: ruby
config: .buildkite/docker-compose.yml
propagate-aws-auth-tokens: true
110 changes: 110 additions & 0 deletions .buildkite/steps/clean-old-amis
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#!/usr/bin/env ruby

require "bundler/inline"
require "date"

gemfile do
source "https://rubygems.org"

gem "oga" # an xml parser is required by aws-sdk
gem "aws-sdk-ec2"
gem "ostruct"
gem "logger"
gem "base64"
end

def die(msg)
$stderr.puts msg
exit 1
end

MAX_DELETIONS = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an arg/env var? Wondering if this script would be useful to run in anger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of starting with 10, just to contain the blast radius of any early mistakes. Then increasing to something like 100 and letting it make progress via a scheduled build (daily? weekly?)


region = ARGV[0] || ENV["AWS_REGION"]
dry_run = ENV["DRY_RUN"]

die("region not found") if region.nil? || region == ""

client = Aws::EC2::Client.new(region: region)

# Fetch all AMIs that we own in the current region
res = client.describe_images(owners: ["self"], include_deprecated: true)
all_images = []
res.images.each do |image|
all_images << image
end

# Filter the list of AMIs down to just those that were published by the elastic stack
# pipeline. There might be other AMIs in this account, and we don't wantto mess with them
all_images.select! { |image|
image.name.start_with?("buildkite-stack-") || # The name we used until mid 2019
image.name.start_with?("buildkite-stack-linux-") || # The name we used for linux amd64/arm64 from mid 2019
image.name.start_with?("buildkite-stack-windows-") # The name we used for windows amd64 from mid 2019
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we had a tag on the AMIs that we could use to find 100% of elastic stack images. Maybe ElasticStack: true or similar. This is a good start for now though


# We'd like to process the images oldest to newest
all_images.sort_by! { |image| image.creation_date }

# Each AMI *can* be used in multiple elastic stack releases. It's rare, but it happens. This will extract one
# of the versions - if any - from the tags. Enough to confirm this image is one we published in Cloud Formation
# templates on githib.com and customers might be using it
def get_stack_version_from_tags(image)
image.tags.each do |tag|
if tag.key.start_with?("Version:")
return tag.key[/^Version:(.+)$/, 1]
end
end
nil
end

# If we only deregister the AMI then we'll be left with orphaned snapshots and keep paying for storage. This
# extracts the snapshot IDs that the AMI is pointing at, so we can delete them as well
def get_snapshot_ids(image)
image.block_device_mappings.map { |blk| blk.ebs&.snapshot_id }.compact
end

# Deregister an AMI, and delete any associated snapshots
def deregister_image(client, image, dry_run)
snapshot_ids = get_snapshot_ids(image)
tag = dry_run ? "[DRY RUN]" : ""

puts "- #{tag} deregistering image #{image.image_id}"
client.deregister_image({image_id: image.image_id}) unless dry_run

snapshot_ids.each do |snapshot_id|
puts "- #{tag} deleting snapshot #{snapshot_id}"
client.delete_snapshot({ snapshot_id: snapshot_id }) unless dry_run
end
end

one_year_ago = Time.now - (60 * 60 * 24 * 365)
deleted_counter = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

puts this at the end?


# Time to get down to business.
#
# Loop over each elastic stack AMI, skip over any that we want to keep, and anything else we can deregister
# and save some money

puts "Found #{all_images.size} images"
puts
puts

all_images.each do |image|
puts "ID: #{image.image_id}, Name: #{image.name}, Created: #{image.creation_date}, Last Launched: #{image.last_launched_time}, Public: #{image.public}, Version Tag; #{get_stack_version_from_tags(image)}"

if get_stack_version_from_tags(image)
puts "- keep (released version)"
elsif DateTime.parse(image.creation_date).to_time >= one_year_ago
puts "- keep (created recently)"
elsif image.last_launched_time && DateTime.parse(image.last_launched_time).to_time >= one_year_ago
puts "- keep (launched recently)"
elsif deleted_counter >= MAX_DELETIONS
puts "- deletion candidate, but we've reached MAX_DELETIONS (#{MAX_DELETIONS})"
else
deregister_image(client, image, dry_run)
deleted_counter += 1
end
puts
puts

end