-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSM issues after upgrade to latest 5.14 #4032
Comments
The change to which refered to was made in release 5.11.0 via PR #3918 (changes https://github.com/philips-labs/terraform-aws-github-runner/pull/3918/files). The change requires ensure you also updated your lambda packages (did you rebuild or download the latest release). In case you are not on the mathcing release your SSM parameter is not tagged with the instance id. In which case the instance has not access to the generated JIT config or token. |
Yes, I upgraded all the lambdas to the latest |
The SSM change was introduced already earlier. And works in our case without an issue. I also tested the examples in default and mult-runner both. They works both as weel. Please could you try the default example with either local build lambda's or the release downloads? |
See also #3959 |
@npalm yes I upgraded my lambdas to the latest before I ran, then I deleted and updated and re-ran just to be sure but ran into the same problem. I will try again later today so as not to block the team with downtime (which happened before). Hopefully I can figure it out. Thanks for looking into it. |
Sad to hear you team had down time. Even we manage this open source repo (best effort). We always first deploy to staging and verify a release upgrade. For this module we continuous focus on non breaking changes and fluent upgrades. |
No problem at all @npalm - it happens since we don't have the notion of a staging setup for our runners. It's a very manual upgrade at the moment where I have to download the lambdas, update the version numbers, than locals {
aws_region = "us-east-2"
}
locals {
github_app_id = "xxxxxx"
}
data "aws_caller_identity" "current" {}
data "aws_ssm_parameter" "github_app_pem_key" {
name = "github_app_pem_key"
}
data "aws_vpc" "default" {
default = true
}
data "aws_subnets" "default" {
filter {
name = "vpc-id"
values = [data.aws_vpc.default.id]
}
}
resource "aws_s3_bucket" "github_self_hosted_runner_lambdas_bucket" {
bucket = var.lambda_s3_bucket
tags = {
Name = "Self-Hosted Lambda S3 Bucket"
}
}
resource "random_id" "random" {
byte_length = 20
}
module "runners" {
source = "philips-labs/github-runner/aws"
create_service_linked_role_spot = true
aws_region = local.aws_region
vpc_id = data.aws_vpc.default.id
subnet_ids = data.aws_subnets.default.ids
# ssm_paths = {
# root = "github-action-runners" # Matches your existing structure
# app = "app"
# runners = "default/runners" # Organizes token parameters under "runners"
# webhook = "webhook"
# use_prefix = false # Ensures full path is used
# }
prefix = "default"
tags = {
Project = "Github-Action-Self-Hosted-Runners"
}
github_app = {
key_base64 = data.aws_ssm_parameter.github_app_pem_key.value
id = local.github_app_id
webhook_secret = random_id.random.hex
}
# Grab zip files via lambda_download
webhook_lambda_zip = "webhook.zip"
runner_binaries_syncer_lambda_zip = "runner-binaries-syncer.zip"
runners_lambda_zip = "runners.zip"
ami_housekeeper_lambda_zip = "ami-housekeeper.zip"
enable_organization_runners = true
runner_extra_labels = ["default", "self-hosted-runners", "runner-ubuntu-jammy-22-lts-amd64"]
# enable access to the runners via SSM
enable_ssm_on_runners = true
# AMI selection
# ami_owners = ["xxxxx"] # Canonical's Amazon account ID
# provide the owner id of
ami_owners = [data.aws_caller_identity.current.account_id]
# When we rebuild an image with packer we replace that value here
ami_filter = {
name = ["github-runner-ubuntu-jammy-22-lts-amd64-202408011442"],
state = ["available"]
}
lambda_s3_bucket = aws_s3_bucket.github_self_hosted_runner_lambdas_bucket.bucket
webhook_lambda_s3_key = "webhook.zip"
runners_lambda_s3_key = "runners.zip"
syncer_lambda_s3_key = "runner-binaries-syncer.zip"
ami_housekeeper_lambda_s3_key = "ami-housekeeper.zip"
# instance_termination_watcher_lambda_s3_key = "termination-watcher.zip"
# enable S3 versioning for runners S3 bucket
runner_binaries_s3_versioning = "Enabled"
# Uncommet idle config to have idle runners from 8 to 5 in time zone Los Angeles Pacific
idle_config = [{
cron = "* * 7-17 * * 1-5"
timeZone = "America/Los_Angeles"
idleCount = 3
evictionStrategy = "oldest_first"
}]
instance_types = ["t3.large", "m5.large"]
# override delay of events in seconds
delay_webhook_event = 5
runners_maximum_count = 12
# set up a fifo queue to remain order
enable_fifo_build_queue = true
# override scaling down
scale_down_schedule_expression = "cron(* * * * ? *)"
# enable this flag to publish webhook events to workflow job queue
# enable_workflow_job_events_queue = true
# Example of simple pool usages
# pool_runner_owner = "Videate"
# pool_config = [{
# size = 5
# schedule_expression = "cron(0 15 ? * MON-FRI *)"
# }]
runner_run_as = "ubuntu"
enable_userdata = false
userdata_template = "./modules/runners/templates/user-data.sh"
enable_user_data_debug_logging_runner = true
# prefix GitHub runners with a special name
runner_name_prefix = "github_self_hosted_"
# Enable debug logging for the lambda functions
# log_level = "debug"
enable_ami_housekeeper = true
ami_housekeeper_cleanup_config = {
ssmParameterNames = ["*/ami-id"]
minimumDaysOld = 10
amiFilters = [
{
Name = "name"
Values = ["*runner-ubuntu*"]
}
]
}
# instance_termination_watcher = {
# enable = true
# enable_metric = {
# spot_warning = true
# }
# s3_bucket = aws_s3_bucket.github_self_hosted_runner_lambdas_bucket.bucket
# s3_key = "termination-watcher.zip"
# }
}
module "lambdas" {
source = "philips-labs/github-runner/aws//modules/download-lambda"
lambdas = [
{
name = "ami-housekeeper"
tag = "v5.14.0"
},
{
name = "webhook"
tag = "v5.14.0"
},
{
name = "runners"
tag = "v5.14.0"
},
{
name = "runner-binaries-syncer"
tag = "v5.14.0"
},
{
name = "termination-watcher"
tag = "v5.14.0"
}
]
}
module "webhook_github_app" {
source = "philips-labs/github-runner/aws//modules/webhook-github-app"
depends_on = [module.runners]
github_app = {
key_base64 = data.aws_ssm_parameter.github_app_pem_key.value
id = local.github_app_id
webhook_secret = random_id.random.hex
}
webhook_endpoint = module.runners.webhook.endpoint
} |
Hey @npalm I upgraded to The two things that I see which are different from good (working) vs bad (broken) policies: {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:GetParametersByPath",
"ssm:DeleteParameter"
],
"Resource": [
"arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/tokens/*",
"arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/config",
"arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/config/*"
]
}
]
} and (not working) {
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ssm:DeleteParameter",
"ssm:GetParameters",
"ssm:GetParameter"
],
"Condition": {
"StringLike": {
"ec2:SourceInstanceARN": "*/${aws:ResourceTag/InstanceId}"
}
},
"Effect": "Allow",
"Resource": "arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/tokens/*"
},
{
"Action": [
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:GetParametersByPath"
],
"Effect": "Allow",
"Resource": [
"arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/config",
"arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/config/*"
]
}
]
} They look very similar minus this: "Condition": {
"StringLike": {
"ec2:SourceInstanceARN": "*/${aws:ResourceTag/InstanceId}"
}
} |
All symptoms points you have incorrect versions of your lambda's. We upgraded last week prod to 5.14 without an issue. |
Ok, I'll double check, but I uploaded the latest 5.14.1 into my s3 folder and they are referenced in my latest main.tf so I would be shocked if there was something off there but I'll definitely make sure. |
I will close the issue for. Hope your problems are resolved. Created issue #4077 to address the problem. |
@npalm The problems are not resolved and this issue shouldn't be closed/moved to a lambda version ticket because I do not think it has anything to do with the versions. I understand, as a developer myself, and as an engineering leader, that sometimes it's easy to say - "sorry, the problem is on your side, and there is nothing we can do" and you might be right. But let me work through this with you just to make sure we don't close/change an issue that might actually be an issue. Here are the steps I followed:
Initializing the backend...
Upgrading modules...
Downloading registry.terraform.io/philips-labs/github-runner/aws 5.15.1 for runners...
- runners in .terraform/modules/runners I uploaded the latest zip files to my S3 bucket and I also downloaded them locally into the same directory just to be sure I had the latest versions across the board. Here is my locals {
aws_region = "us-east-2"
}
locals {
github_app_id = "xxxxxx"
}
data "aws_caller_identity" "current" {}
data "aws_ssm_parameter" "github_app_pem_key" {
name = "github_app_pem_key"
}
data "aws_vpc" "default" {
default = true
}
data "aws_subnets" "default" {
filter {
name = "vpc-id"
values = [data.aws_vpc.default.id]
}
}
resource "aws_s3_bucket" "github_self_hosted_runner_lambdas_bucket" {
bucket = var.lambda_s3_bucket
tags = {
Name = "Self-Hosted Lambda S3 Bucket"
}
}
resource "random_id" "random" {
byte_length = 20
}
module "runners" {
source = "philips-labs/github-runner/aws"
create_service_linked_role_spot = true
aws_region = local.aws_region
vpc_id = data.aws_vpc.default.id
subnet_ids = data.aws_subnets.default.ids
prefix = "default"
tags = {
Project = "Github-Action-Self-Hosted-Runners"
}
github_app = {
key_base64 = data.aws_ssm_parameter.github_app_pem_key.value
id = local.github_app_id
webhook_secret = random_id.random.hex
}
# Grab zip files via lambda_download
webhook_lambda_zip = "./webhook.zip"
runner_binaries_syncer_lambda_zip = "./runner-binaries-syncer.zip"
runners_lambda_zip = "./runners.zip"
ami_housekeeper_lambda_zip = "./ami-housekeeper.zip"
enable_organization_runners = true
runner_extra_labels = ["default", "self-hosted-runners", "runner-ubuntu-jammy-22-lts-amd64"]
# enable access to the runners via SSM
enable_ssm_on_runners = true
ami_owners = [data.aws_caller_identity.current.account_id]
# When we rebuild an image with packer we replace that value here
ami_filter = {
name = ["github-runner-ubuntu-jammy-22-lts-amd64-202408011442"],
state = ["available"]
}
lambda_s3_bucket = aws_s3_bucket.github_self_hosted_runner_lambdas_bucket.bucket
webhook_lambda_s3_key = "webhook.zip"
runners_lambda_s3_key = "runners.zip"
syncer_lambda_s3_key = "runner-binaries-syncer.zip"
ami_housekeeper_lambda_s3_key = "ami-housekeeper.zip"
# enable S3 versioning for runners S3 bucket
runner_binaries_s3_versioning = "Enabled"
# Uncomment idle config to have idle runners from 8 to 5 in time zone Los Angeles Pacific
idle_config = [{
cron = "* * 7-17 * * 1-5"
timeZone = "America/Los_Angeles"
idleCount = 3
evictionStrategy = "oldest_first"
}]
instance_types = ["t3.large", "m5.large"]
# override delay of events in seconds
delay_webhook_event = 5
runners_maximum_count = 12
# set up a fifo queue to remain order
enable_fifo_build_queue = true
# override scaling down
scale_down_schedule_expression = "cron(* * * * ? *)"
runner_run_as = "ubuntu"
enable_userdata = false
userdata_template = "./modules/runners/templates/user-data.sh"
enable_user_data_debug_logging_runner = true
# prefix GitHub runners with a special name
runner_name_prefix = "github_self_hosted_"
# Enable debug logging for the lambda functions
# log_level = "debug"
enable_ami_housekeeper = true
ami_housekeeper_cleanup_config = {
ssmParameterNames = ["*/ami-id"]
minimumDaysOld = 10
amiFilters = [
{
Name = "name"
Values = ["*runner-ubuntu*"]
}
]
}
}
module "webhook_github_app" {
source = "philips-labs/github-runner/aws//modules/webhook-github-app"
depends_on = [module.runners]
github_app = {
key_base64 = data.aws_ssm_parameter.github_app_pem_key.value
id = local.github_app_id
webhook_secret = random_id.random.hex
}
webhook_endpoint = module.runners.webhook.endpoint
} Here is my variables.tf: variable "lambda_s3_bucket" {
description = "S3 bucket from which to specify lambda functions. This is an alternative to providing local files directly."
type = string
default = "gh-self-hosted-runner-lambdas"
} These are the terraform versions I'm using: terraform -v
Terraform v1.5.7
on darwin_arm64
+ provider registry.terraform.io/hashicorp/aws v5.63.0
+ provider registry.terraform.io/hashicorp/local v2.5.1
+ provider registry.terraform.io/hashicorp/null v3.2.2
+ provider registry.terraform.io/hashicorp/random v3.6.2 This is the policy that works: {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:GetParametersByPath",
"ssm:DeleteParameter"
],
"Resource": [
"arn:aws:ssm:us-east-2:xxxxxxxxxxx:parameter/github-action-runners/default/runners/tokens/*",
"arn:aws:ssm:us-east-2:xxxxxxxxxxx:parameter/github-action-runners/default/runners/config",
"arn:aws:ssm:us-east-2:xxxxxxxxxxx:parameter/github-action-runners/default/runners/config/*"
]
}
]
} This is what happens when I run terraform apply: policy = jsonencode(
~ {
~ Statement = [
~ {
~ Action = [
- "ssm:GetParameter",
- "ssm:GetParameters",
- "ssm:GetParametersByPath",
"ssm:DeleteParameter",
+ "ssm:GetParameters",
+ "ssm:GetParameter",
]
+ Condition = {
+ StringLike = {
+ "ec2:SourceInstanceARN" = "*/${aws:ResourceTag/InstanceId}"
}
}
~ Resource = [
- "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/tokens/*",
- "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/config",
- "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/config/*",
] -> "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/tokens/*"
# (1 unchanged attribute hidden)
},
+ {
+ Action = [
+ "ssm:GetParameter",
+ "ssm:GetParameters",
+ "ssm:GetParametersByPath",
]
+ Effect = "Allow"
+ Resource = [
+ "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/config",
+ "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/config/*",
]
},
]
# (1 unchanged attribute hidden)
}
) I feel like the clues are there. The versions are updated and completely aligned, so it's something else with either the the path to the config/tokens or this block:
|
@npalm - I am facing the same issue. I've confirmed my lambdas are up to date at v5.15.2, the same as my module. I also see that my parameters are not tagged: You mentioned the parameters should be tagged with the instance ID - where is this done? I see the Update: Let me know if I can provide any additional information |
That is all very annoying. I am not able to reproduce the problem. Our in house deployments are using also the release lambda's. And we have no problem. Instances variables are proper tagged, and we are currently on the latest release as well. We test the module with the examples. Mostly using How can we move forward to solve the issues? Keep in mind we share the module open source and do our best to keep the quality high. In the end it is open source and there is no commercial value for us. We simply rely on the help we also got from the community.
Still most logical seems for some reason your lambda is using the old code :( |
@npalm - First the good news - it is working for me now. The "bad" news is that I had to run As I showed above, I had the functions updated per the 5.15.2 tag and saw the timestamps for the S3 objects updated and the last update time on the lambda resource itself updated and in sync. I agree with you it should have worked and I don't think this is an issue with the code in this repo. I confirmed this by opening the zip file from S3 and saw the tagging code was there. It's very odd. The only thing I can suggest is when updating, confirm with the |
@npalm - This may be far fetched, but could this have to do with SSM Parameters and not being able to overwrite them without explicitly providing the overwrite option? I did not see this the first time I tried to deploy the new version in the staging account we have set up. However, in our dev account, I got this: That made me think it may be a similar issue in the lambda. If the parameter is created, then the lambda tries to update it without CLI docs here
I have my doubts since this is creating the token parameter for the instance, but I'm not certain if PutParameter runs on it more than once. I don't have time to investigate further at this moment. |
@jkruse14 I think you are onto something. I was only able to get it working again with destroy and re-apply so that idea has some legs. |
Not sure, since parameters are not overwritten. It is the lambda that is tagging the created variable. The variable is always new (name is instance id). So I don't see how the update is involved. Did you try on of the suggested examples? |
I cannot replicate now that everything is up-to-date. Also, to add to the mystery, I did not have to destroy the workspace when applying to our dev account. I did run into some issues with resources apparently not being managed by TF in dev, but not in qa and prod, per the screen shot above, but that wouldn't have to do with this code. I have not seen any issues with the instances not having access to the parameters. |
I also experienced this issue. I am using a fork that merged code from v5.16.0 and that is building lambdas from source. I downloaded the scale-up lambda source and I see that it contains the code change in PR #3918 . In my case I am using multi-runner with two runner-config templates defined. The first, original runner-config that uses the initial template is working, but the 2nd is failing. When I checked logs I saw that the 2nd one is looking for a parameter such as I recalled that I had manually added the first SSM parameter and set the value of it to 1. I don't know why I had to do this initially for the original template--it was some time ago. But for me things began to work when I manually added the 2nd SSM parameter and set its value to 1 as well. So for my case it would help if I could understand why I had to add these SSM parameters manually and why the lambda code didn't add these. Before I manually added this SSM parameter, the logs showed this entry: I see code in scale-up.ts that seems like it should attempt to add the parameter if it did not exist. So I can conclude that the code which was supposed to call putParameter when it found this SSM parameter did not exist either
|
I'm also experiencing the same error when using the multi-runner example using v5.17.0, with the following modifications:
All the rest is the exact same than the multi-runner sample.
I have a very hard time to pinpoint to what is going wrong, I tried with a simpler config (using the default example with the same modifications for lamdas and source) and I'm not facing that issue, the runner starts properly and is able to pick-up jobs. [UPDATE] Looking at debug logs in cloudwatch, I do see this error for the scale-up group: "error": {
"name": "GetParameterError",
"location": "/var/task/index.js:151668",
"message": "UnknownError",
"stack": "GetParameterError: UnknownError\n at SSMProvider.get (/var/task/index.js:151668:19)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async getParameter (/var/task/index.js:148245:20)\n at async getRunnerGroupId (/var/task/index.js:147798:27)\n at async createJitConfig (/var/task/index.js:147997:27)\n at async createStartRunnerConfig (/var/task/index.js:147964:9)\n at async createRunners (/var/task/index.js:147842:9)\n at async scaleUp (/var/task/index.js:147915:13)\n at async Runtime.scaleUpHandler [as handler] (/var/task/index.js:147082:9)"
} And a WARN: SSM Parameter \"/github-action-runners/ephe5170/runners/config/runner-group/Default\"\n for Runner group Default does not exist
[UPDATE Bis] |
I had to manually over-ride my default-runner-role policies otherwise my runners never started executing the jobs, the diffs I had to make are shown below. It led to the following errors on my instances during startup:
I ran
terraform apply
after the manual changes that got me working just to show what was changed (and broken). This is what jumped out at me the most:Output of
terraform apply
Is there something new in the latest
main.tf
config that I am now missing after not upgrading for a couple of months?The text was updated successfully, but these errors were encountered: