Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSM issues after upgrade to latest 5.14 #4032

Open
videate-josh opened this issue Aug 1, 2024 · 21 comments
Open

SSM issues after upgrade to latest 5.14 #4032

videate-josh opened this issue Aug 1, 2024 · 21 comments
Labels

Comments

@videate-josh
Copy link

I had to manually over-ride my default-runner-role policies otherwise my runners never started executing the jobs, the diffs I had to make are shown below. It led to the following errors on my instances during startup:

An error occurred (AccessDeniedException) when calling the GetParameter operation: User: arn:aws:sts::xxxxxxx:assumed-role/default-runner-role/i-xxxxxxx is not authorized to perform: ssm:GetParameter on resource: arn:aws:ssm:us-east-2:xxxxxxx:parameter/github-action-runners/default/runners/tokens/i-xxxxxxx because no identity-based policy allows the ssm:GetParameter action
Waiting for GH Runner config to become available in AWS SSM

I ran terraform apply after the manual changes that got me working just to show what was changed (and broken). This is what jumped out at me the most:

Condition = {
                          + StringLike = {
                              + "ec2:SourceInstanceARN" = "*/${aws:ResourceTag/InstanceId}"

Output of terraform apply

# module.runners.module.runners.aws_iam_role_policy.ssm_parameters will be updated in-place
  ~ resource "aws_iam_role_policy" "ssm_parameters" {
        id     = "default-runner-role:runner-ssm-parameters"
        name   = "runner-ssm-parameters"
      ~ policy = jsonencode(
          ~ {
              ~ Statement = [
                  ~ {
                      ~ Action    = [
                          - "ssm:GetParameter",
                          - "ssm:GetParameters",
                          - "ssm:GetParametersByPath",
                            "ssm:DeleteParameter",
                          + "ssm:GetParameters",
                          + "ssm:GetParameter",
                        ]
                      + Condition = {
                          + StringLike = {
                              + "ec2:SourceInstanceARN" = "*/${aws:ResourceTag/InstanceId}"
                            }
                        }
                      ~ Resource  = [
                          - "arn:aws:ssm:us-east-2:xxxxxxx:parameter/github-action-runners/default/runners/tokens/*",
                          - "arn:aws:ssm:us-east-2:xxxxxxx:parameter/github-action-runners/default/runners/config",
                          - "arn:aws:ssm:us-east-2:xxxxxxx:parameter/github-action-runners/default/runners/config/*",
                        ] -> "arn:aws:ssm:us-east-2:xxxxxxx:parameter/github-action-runners/default/runners/tokens/*"
                        # (1 unchanged attribute hidden)
                    },
                  + {
                      + Action   = [
                          + "ssm:GetParameter",
                          + "ssm:GetParameters",
                          + "ssm:GetParametersByPath",
                        ]
                      + Effect   = "Allow"
                      + Resource = [
                          + "arn:aws:ssm:us-east-2:xxxxxxx:parameter/github-action-runners/default/runners/config",
                          + "arn:aws:ssm:us-east-2:xxxxxxx:parameter/github-action-runners/default/runners/config/*",
                        ]
                    },
                ]
                # (1 unchanged attribute hidden)
            }
        )
        # (1 unchanged attribute hidden)
    }

Is there something new in the latest main.tf config that I am now missing after not upgrading for a couple of months?

@npalm
Copy link
Member

npalm commented Aug 5, 2024

The change to which refered to was made in release 5.11.0 via PR #3918 (changes https://github.com/philips-labs/terraform-aws-github-runner/pull/3918/files). The change requires ensure you also updated your lambda packages (did you rebuild or download the latest release). In case you are not on the mathcing release your SSM parameter is not tagged with the instance id. In which case the instance has not access to the generated JIT config or token.

@videate-josh
Copy link
Author

Yes, I upgraded all the lambdas to the latest 5.14 but it still had the same issue until I manually altered the default-runner-role policy. Is there any sort of config option or anything else I need to do? It's working but it breaks when I revert back to the suggested changes from the module.

@npalm
Copy link
Member

npalm commented Aug 6, 2024

The SSM change was introduced already earlier. And works in our case without an issue. I also tested the examples in default and mult-runner both. They works both as weel. Please could you try the default example with either local build lambda's or the release downloads?

@npalm
Copy link
Member

npalm commented Aug 6, 2024

See also #3959

@videate-josh
Copy link
Author

@npalm yes I upgraded my lambdas to the latest before I ran, then I deleted and updated and re-ran just to be sure but ran into the same problem. I will try again later today so as not to block the team with downtime (which happened before). Hopefully I can figure it out. Thanks for looking into it.

@npalm
Copy link
Member

npalm commented Aug 7, 2024

Sad to hear you team had down time. Even we manage this open source repo (best effort). We always first deploy to staging and verify a release upgrade. For this module we continuous focus on non breaking changes and fluent upgrades.

@videate-josh
Copy link
Author

No problem at all @npalm - it happens since we don't have the notion of a staging setup for our runners. It's a very manual upgrade at the moment where I have to download the lambdas, update the version numbers, than terraform init --upgrade and then terraform apply. Here is my main.tf let me know if you see anything that looks off.

locals {
  aws_region = "us-east-2"
}

locals {
  github_app_id = "xxxxxx"
}

data "aws_caller_identity" "current" {}

data "aws_ssm_parameter" "github_app_pem_key" {
  name = "github_app_pem_key"
}

data "aws_vpc" "default" {
  default = true
}

data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }
}

resource "aws_s3_bucket" "github_self_hosted_runner_lambdas_bucket" {
  bucket = var.lambda_s3_bucket
  tags = {
    Name = "Self-Hosted Lambda S3 Bucket"
  }
}

resource "random_id" "random" {
  byte_length = 20
}

module "runners" {
  source                          = "philips-labs/github-runner/aws"
  create_service_linked_role_spot = true
  aws_region                      = local.aws_region
  vpc_id                          = data.aws_vpc.default.id
  subnet_ids                      = data.aws_subnets.default.ids

  # ssm_paths = {
  #   root       = "github-action-runners"   # Matches your existing structure
  #   app        = "app"
  #   runners    = "default/runners"                 # Organizes token parameters under "runners"
  #   webhook    = "webhook"
  #   use_prefix = false                     # Ensures full path is used
  # }

  prefix = "default"
  tags = {
    Project = "Github-Action-Self-Hosted-Runners"
  }

  github_app = {
    key_base64     = data.aws_ssm_parameter.github_app_pem_key.value
    id             = local.github_app_id
    webhook_secret = random_id.random.hex
  }

  # Grab zip files via lambda_download
  webhook_lambda_zip                = "webhook.zip"
  runner_binaries_syncer_lambda_zip = "runner-binaries-syncer.zip"
  runners_lambda_zip                = "runners.zip"
  ami_housekeeper_lambda_zip        = "ami-housekeeper.zip"

  enable_organization_runners = true
  runner_extra_labels         = ["default", "self-hosted-runners", "runner-ubuntu-jammy-22-lts-amd64"]

  # enable access to the runners via SSM
  enable_ssm_on_runners = true

  # AMI selection 
  # ami_owners        = ["xxxxx"] # Canonical's Amazon account ID
  # provide the owner id of
  ami_owners = [data.aws_caller_identity.current.account_id]

  # When we rebuild an image with packer we replace that value here
  ami_filter = {
    name  = ["github-runner-ubuntu-jammy-22-lts-amd64-202408011442"],
    state = ["available"]
  }

  lambda_s3_bucket              = aws_s3_bucket.github_self_hosted_runner_lambdas_bucket.bucket
  webhook_lambda_s3_key         = "webhook.zip"
  runners_lambda_s3_key         = "runners.zip"
  syncer_lambda_s3_key          = "runner-binaries-syncer.zip"
  ami_housekeeper_lambda_s3_key = "ami-housekeeper.zip"
  # instance_termination_watcher_lambda_s3_key = "termination-watcher.zip"

  # enable S3 versioning for runners S3 bucket
  runner_binaries_s3_versioning = "Enabled"

  # Uncommet idle config to have idle runners from 8 to 5 in time zone Los Angeles Pacific
  idle_config = [{
    cron             = "* * 7-17 * * 1-5"
    timeZone         = "America/Los_Angeles"
    idleCount        = 3
    evictionStrategy = "oldest_first"
  }]

  instance_types = ["t3.large", "m5.large"]

  # override delay of events in seconds
  delay_webhook_event   = 5
  runners_maximum_count = 12

  # set up a fifo queue to remain order
  enable_fifo_build_queue = true

  # override scaling down
  scale_down_schedule_expression = "cron(* * * * ? *)"
  # enable this flag to publish webhook events to workflow job queue
  # enable_workflow_job_events_queue  = true

  # Example of simple pool usages
  # pool_runner_owner = "Videate"
  # pool_config = [{
  #   size                = 5
  #   schedule_expression = "cron(0 15 ? * MON-FRI *)"
  # }]
  runner_run_as                         = "ubuntu"
  enable_userdata                       = false
  userdata_template                     = "./modules/runners/templates/user-data.sh"
  enable_user_data_debug_logging_runner = true

  # prefix GitHub runners with a special name
  runner_name_prefix = "github_self_hosted_"

  # Enable debug logging for the lambda functions
  # log_level = "debug"

  enable_ami_housekeeper = true
  ami_housekeeper_cleanup_config = {
    ssmParameterNames = ["*/ami-id"]
    minimumDaysOld    = 10
    amiFilters = [
      {
        Name   = "name"
        Values = ["*runner-ubuntu*"]
      }
    ]
  }
  
  # instance_termination_watcher = {
  #   enable = true
  #   enable_metric = {
  #     spot_warning = true
  #   }
  #   s3_bucket = aws_s3_bucket.github_self_hosted_runner_lambdas_bucket.bucket
  #   s3_key = "termination-watcher.zip" 
  # }
}

module "lambdas" {
  source = "philips-labs/github-runner/aws//modules/download-lambda"
  lambdas = [
    {
      name = "ami-housekeeper"
      tag  = "v5.14.0"
    },
    {
      name = "webhook"
      tag  = "v5.14.0"
    },
    {
      name = "runners"
      tag  = "v5.14.0"
    },
    {
      name = "runner-binaries-syncer"
      tag  = "v5.14.0"
    },
    {
      name = "termination-watcher"
      tag  = "v5.14.0"
    }
  ]
}

module "webhook_github_app" {
  source     = "philips-labs/github-runner/aws//modules/webhook-github-app"
  depends_on = [module.runners]

  github_app = {
    key_base64     = data.aws_ssm_parameter.github_app_pem_key.value
    id             = local.github_app_id
    webhook_secret = random_id.random.hex
  }
  webhook_endpoint = module.runners.webhook.endpoint
}

@videate-josh
Copy link
Author

Hey @npalm I upgraded to 5.14.1 and ran into the same issue. Is there anything missing from my setup above that would give you a clue?

The two things that I see which are different from good (working) vs bad (broken) policies:
(working)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter",
                "ssm:GetParameters",
                "ssm:GetParametersByPath",
                "ssm:DeleteParameter"
            ],
            "Resource": [
                "arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/tokens/*",
                "arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/config",
                "arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/config/*"
            ]
        }
    ]
}

and

(not working)

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Action": [
				"ssm:DeleteParameter",
				"ssm:GetParameters",
				"ssm:GetParameter"
			],
			"Condition": {
				"StringLike": {
					"ec2:SourceInstanceARN": "*/${aws:ResourceTag/InstanceId}"
				}
			},
			"Effect": "Allow",
			"Resource": "arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/tokens/*"
		},
		{
			"Action": [
				"ssm:GetParameter",
				"ssm:GetParameters",
				"ssm:GetParametersByPath"
			],
			"Effect": "Allow",
			"Resource": [
				"arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/config",
				"arn:aws:ssm:us-east-2:xxxxxxxx:parameter/github-action-runners/default/runners/config/*"
			]
		}
	]
}

They look very similar minus this:

"Condition": {
    "StringLike": {
	"ec2:SourceInstanceARN": "*/${aws:ResourceTag/InstanceId}"
    }
 }

@npalm
Copy link
Member

npalm commented Aug 13, 2024

All symptoms points you have incorrect versions of your lambda's. We upgraded last week prod to 5.14 without an issue.

@videate-josh
Copy link
Author

Ok, I'll double check, but I uploaded the latest 5.14.1 into my s3 folder and they are referenced in my latest main.tf so I would be shocked if there was something off there but I'll definitely make sure.

@npalm
Copy link
Member

npalm commented Aug 16, 2024

I will close the issue for. Hope your problems are resolved. Created issue #4077 to address the problem.

@npalm npalm closed this as completed Aug 16, 2024
@videate-josh
Copy link
Author

@npalm The problems are not resolved and this issue shouldn't be closed/moved to a lambda version ticket because I do not think it has anything to do with the versions.

I understand, as a developer myself, and as an engineering leader, that sometimes it's easy to say - "sorry, the problem is on your side, and there is nothing we can do" and you might be right.

But let me work through this with you just to make sure we don't close/change an issue that might actually be an issue.

Here are the steps I followed:

  • terraform init --upgrade
    Which led to:
Initializing the backend...
Upgrading modules...
Downloading registry.terraform.io/philips-labs/github-runner/aws 5.15.1 for runners...
- runners in .terraform/modules/runners

I uploaded the latest zip files to my S3 bucket and I also downloaded them locally into the same directory just to be sure I had the latest versions across the board.

Here is my main.tf:

locals {
  aws_region = "us-east-2"
}

locals {
  github_app_id = "xxxxxx"
}

data "aws_caller_identity" "current" {}

data "aws_ssm_parameter" "github_app_pem_key" {
  name = "github_app_pem_key"
}

data "aws_vpc" "default" {
  default = true
}

data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }
}

resource "aws_s3_bucket" "github_self_hosted_runner_lambdas_bucket" {
  bucket = var.lambda_s3_bucket
  tags = {
    Name = "Self-Hosted Lambda S3 Bucket"
  }
}

resource "random_id" "random" {
  byte_length = 20
}

module "runners" {
  source                          = "philips-labs/github-runner/aws"
  create_service_linked_role_spot = true
  aws_region                      = local.aws_region
  vpc_id                          = data.aws_vpc.default.id
  subnet_ids                      = data.aws_subnets.default.ids

  prefix = "default"
  tags = {
    Project = "Github-Action-Self-Hosted-Runners"
  }

  github_app = {
    key_base64     = data.aws_ssm_parameter.github_app_pem_key.value
    id             = local.github_app_id
    webhook_secret = random_id.random.hex
  }

  # Grab zip files via lambda_download
  webhook_lambda_zip                = "./webhook.zip"
  runner_binaries_syncer_lambda_zip = "./runner-binaries-syncer.zip"
  runners_lambda_zip                = "./runners.zip"
  ami_housekeeper_lambda_zip        = "./ami-housekeeper.zip"

  enable_organization_runners = true
  runner_extra_labels         = ["default", "self-hosted-runners", "runner-ubuntu-jammy-22-lts-amd64"]

  # enable access to the runners via SSM
  enable_ssm_on_runners = true

  ami_owners = [data.aws_caller_identity.current.account_id]

  # When we rebuild an image with packer we replace that value here
  ami_filter = {
    name  = ["github-runner-ubuntu-jammy-22-lts-amd64-202408011442"],
    state = ["available"]
  }

  lambda_s3_bucket              = aws_s3_bucket.github_self_hosted_runner_lambdas_bucket.bucket
  webhook_lambda_s3_key         = "webhook.zip"
  runners_lambda_s3_key         = "runners.zip"
  syncer_lambda_s3_key          = "runner-binaries-syncer.zip"
  ami_housekeeper_lambda_s3_key = "ami-housekeeper.zip"

  # enable S3 versioning for runners S3 bucket
  runner_binaries_s3_versioning = "Enabled"

  # Uncomment idle config to have idle runners from 8 to 5 in time zone Los Angeles Pacific
  idle_config = [{
    cron             = "* * 7-17 * * 1-5"
    timeZone         = "America/Los_Angeles"
    idleCount        = 3
    evictionStrategy = "oldest_first"
  }]

  instance_types = ["t3.large", "m5.large"]

  # override delay of events in seconds
  delay_webhook_event   = 5
  runners_maximum_count = 12

  # set up a fifo queue to remain order
  enable_fifo_build_queue = true

  # override scaling down
  scale_down_schedule_expression = "cron(* * * * ? *)"

  runner_run_as                         = "ubuntu"
  enable_userdata                       = false
  userdata_template                     = "./modules/runners/templates/user-data.sh"
  enable_user_data_debug_logging_runner = true

  # prefix GitHub runners with a special name
  runner_name_prefix = "github_self_hosted_"

  # Enable debug logging for the lambda functions
  # log_level = "debug"

  enable_ami_housekeeper = true
  ami_housekeeper_cleanup_config = {
    ssmParameterNames = ["*/ami-id"]
    minimumDaysOld    = 10
    amiFilters = [
      {
        Name   = "name"
        Values = ["*runner-ubuntu*"]
      }
    ]
  }
}

module "webhook_github_app" {
  source     = "philips-labs/github-runner/aws//modules/webhook-github-app"
  depends_on = [module.runners]

  github_app = {
    key_base64     = data.aws_ssm_parameter.github_app_pem_key.value
    id             = local.github_app_id
    webhook_secret = random_id.random.hex
  }
  webhook_endpoint = module.runners.webhook.endpoint
}

Here is my variables.tf:

variable "lambda_s3_bucket" {
  description = "S3 bucket from which to specify lambda functions. This is an alternative to providing local files directly."
  type        = string
  default     = "gh-self-hosted-runner-lambdas"
}

These are the terraform versions I'm using:

terraform -v  
Terraform v1.5.7
on darwin_arm64
+ provider registry.terraform.io/hashicorp/aws v5.63.0
+ provider registry.terraform.io/hashicorp/local v2.5.1
+ provider registry.terraform.io/hashicorp/null v3.2.2
+ provider registry.terraform.io/hashicorp/random v3.6.2

This is the policy that works:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter",
                "ssm:GetParameters",
                "ssm:GetParametersByPath",
                "ssm:DeleteParameter"
            ],
            "Resource": [
                "arn:aws:ssm:us-east-2:xxxxxxxxxxx:parameter/github-action-runners/default/runners/tokens/*",
                "arn:aws:ssm:us-east-2:xxxxxxxxxxx:parameter/github-action-runners/default/runners/config",
                "arn:aws:ssm:us-east-2:xxxxxxxxxxx:parameter/github-action-runners/default/runners/config/*"
            ]
        }
    ]
}

This is what happens when I run terraform apply:

policy = jsonencode(
          ~ {
              ~ Statement = [
                  ~ {
                      ~ Action    = [
                          - "ssm:GetParameter",
                          - "ssm:GetParameters",
                          - "ssm:GetParametersByPath",
                            "ssm:DeleteParameter",
                          + "ssm:GetParameters",
                          + "ssm:GetParameter",
                        ]
                      + Condition = {
                          + StringLike = {
                              + "ec2:SourceInstanceARN" = "*/${aws:ResourceTag/InstanceId}"
                            }
                        }
                      ~ Resource  = [
                          - "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/tokens/*",
                          - "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/config",
                          - "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/config/*",
                        ] -> "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/tokens/*"
                        # (1 unchanged attribute hidden)
                    },
                  + {
                      + Action   = [
                          + "ssm:GetParameter",
                          + "ssm:GetParameters",
                          + "ssm:GetParametersByPath",
                        ]
                      + Effect   = "Allow"
                      + Resource = [
                          + "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/config",
                          + "arn:aws:ssm:us-east-2:434020178465:parameter/github-action-runners/default/runners/config/*",
                        ]
                    },
                ]
                # (1 unchanged attribute hidden)
            }
        )

I feel like the clues are there. The versions are updated and completely aligned, so it's something else with either the the path to the config/tokens or this block:

                     Condition = {
                          StringLike = {
                              "ec2:SourceInstanceARN" = "*/${aws:ResourceTag/InstanceId}"
                          }
                      }

@jkruse14
Copy link
Contributor

jkruse14 commented Aug 22, 2024

@npalm - I am facing the same issue. I've confirmed my lambdas are up to date at v5.15.2, the same as my module.
image

I also see that my parameters are not tagged:
image

You mentioned the parameters should be tagged with the instance ID - where is this done? I see the PutParameter event for the scale up lambda, but do not see tags associated with the request. I also couldn't find an AddTagsToResource events in CloudTrail.

Update:
I found where tagging is done in the scale up lambda. I don't know why it wouldn't get tagged based on the code there. I also don't see any AccessDenied or other errors in CloudTrail associated with the lambda (filtering by username with the lambda name).

Let me know if I can provide any additional information

@npalm
Copy link
Member

npalm commented Aug 22, 2024

That is all very annoying. I am not able to reproduce the problem. Our in house deployments are using also the release lambda's. And we have no problem. Instances variables are proper tagged, and we are currently on the latest release as well.

image

We test the module with the examples. Mostly using examples/default and examples/multi-runner/ with locally build lambda.

How can we move forward to solve the issues?

Keep in mind we share the module open source and do our best to keep the quality high. In the end it is open source and there is no commercial value for us. We simply rely on the help we also got from the community.

  • Can you try to deplaoy either the default example or multi-runner example and check if you have the same problem. You can build the lambdas by running the docker build script in the ci dir. Or if you have node. Just run yarn && yarn dist in the lambda dir. By comment all lambda zip files the module will pick-up the created zip by yarn. (npm will work as well)
  • On your failing environment you can enable the debug logging by setting the LOG_LEVEL to debug. Hopefully this gives more information.

Still most logical seems for some reason your lambda is using the old code :(

@npalm npalm reopened this Aug 22, 2024
@jkruse14
Copy link
Contributor

jkruse14 commented Aug 22, 2024

@npalm - First the good news - it is working for me now.

The "bad" news is that I had to run terraform destroy, then terraform apply. Not the biggest deal, but just odd.

As I showed above, I had the functions updated per the 5.15.2 tag and saw the timestamps for the S3 objects updated and the last update time on the lambda resource itself updated and in sync. I agree with you it should have worked and I don't think this is an issue with the code in this repo. I confirmed this by opening the zip file from S3 and saw the tagging code was there.

It's very odd. The only thing I can suggest is when updating, confirm with the terraform plan that your lambdas will be updated. That may be obvious and not particularly helpful... but it's what I've got at this point 🤷

@jkruse14
Copy link
Contributor

@npalm - This may be far fetched, but could this have to do with SSM Parameters and not being able to overwrite them without explicitly providing the overwrite option?

I did not see this the first time I tried to deploy the new version in the staging account we have set up. However, in our dev account, I got this:
image

That made me think it may be a similar issue in the lambda. If the parameter is created, then the lambda tries to update it without overwrite set to true, then there might be an issue. This would also explain why destroying the workspace fixed it. That being said, I would have expected some error message somewhere to have popped up were this the case.

CLI docs here

--overwrite | --no-overwrite (boolean)

Overwrite an existing parameter. The default value is false

I have my doubts since this is creating the token parameter for the instance, but I'm not certain if PutParameter runs on it more than once. I don't have time to investigate further at this moment.

@videate-josh
Copy link
Author

@jkruse14 I think you are onto something. I was only able to get it working again with destroy and re-apply so that idea has some legs.

@npalm
Copy link
Member

npalm commented Aug 23, 2024

Not sure, since parameters are not overwritten. It is the lambda that is tagging the created variable. The variable is always new (name is instance id). So I don't see how the update is involved.

Did you try on of the suggested examples?

@jkruse14
Copy link
Contributor

I cannot replicate now that everything is up-to-date. Also, to add to the mystery, I did not have to destroy the workspace when applying to our dev account. I did run into some issues with resources apparently not being managed by TF in dev, but not in qa and prod, per the screen shot above, but that wouldn't have to do with this code. I have not seen any issues with the instances not having access to the parameters.

@sjelisonI
Copy link

I also experienced this issue. I am using a fork that merged code from v5.16.0 and that is building lambdas from source. I downloaded the scale-up lambda source and I see that it contains the code change in PR #3918 .

In my case I am using multi-runner with two runner-config templates defined. The first, original runner-config that uses the initial template is working, but the 2nd is failing. When I checked logs I saw that the 2nd one is looking for a parameter such as
github-action-runners/ephemeral/myco-linux-x64-ubuntu24/runners/config/runner-group/Default
and it does not exist. The first, original template uses this:
github-action-runners/ephemeral/myco-linux-x64-ubuntu/runners/config/runner-group/Default
which does exist. So it appears as if the scale-up lambda is able to launch the instance, but then not able to create the SSM parameter, so later when the instance goes looking for it it does not find it and gives the error listed in this issue.

I recalled that I had manually added the first SSM parameter and set the value of it to 1. I don't know why I had to do this initially for the original template--it was some time ago. But for me things began to work when I manually added the 2nd SSM parameter and set its value to 1 as well.

So for my case it would help if I could understand why I had to add these SSM parameters manually and why the lambda code didn't add these.

Before I manually added this SSM parameter, the logs showed this entry:
SSM Parameter "/github-action-runners/ephemeral/myco-linux-x64-ubuntu24/runners/config/runner-group/Default" for Runner group Default does not exist
...but they did NOT show anything else other than one more log entry that says
Ignoring error: HttpError: Not Found - https://docs.github.com/rest/actions/self-hosted-runner-groups#list-self-hosted-runner-groups-for-an-organization
...which is the last log entry.

I see code in scale-up.ts that seems like it should attempt to add the parameter if it did not exist. So I can conclude that the code which was supposed to call putParameter when it found this SSM parameter did not exist either

  1. did not run (best guess)
  2. failed, though I did NOT see the entry "Error storing runner group id in SSM Parameter Store" in the log) or
  3. it ran successfully but then something else removed the parameter (doubtful).

@Fgerthoffert
Copy link
Contributor

Fgerthoffert commented Oct 12, 2024

I'm also experiencing the same error when using the multi-runner example using v5.17.0, with the following modifications:

  • Uncommented the lambdas to use the local ones
  • Addressing to modules using a remote source (i.e. source = "git::https://github.com/philips-labs/terraform-aws-github-runner.git//examples/base?ref=v5.17.0

All the rest is the exact same than the multi-runner sample.

Oct 12 15:57:55 ip-10-0-2-39 user-data: An error occurred (AccessDeniedException) when calling the GetParameter operation: User: arn:aws:sts::REDACTED:assumed-role/sdjzkx-linux-x64-ubuntu-runner-role/i-0bb6079208455690f is not authorized to perform: ssm:GetParameter on resource: arn:aws:ssm:eu-west-1:REDACTED:parameter/github-action-runners/sdjzkx/linux-x64-ubuntu/runners/tokens/i-0bb6079208455690f because no identity-based policy allows the ssm:GetParameter action

I have a very hard time to pinpoint to what is going wrong, I tried with a simpler config (using the default example with the same modifications for lamdas and source) and I'm not facing that issue, the runner starts properly and is able to pick-up jobs.

[UPDATE]

Looking at debug logs in cloudwatch, I do see this error for the scale-up group:

    "error": {
        "name": "GetParameterError",
        "location": "/var/task/index.js:151668",
        "message": "UnknownError",
        "stack": "GetParameterError: UnknownError\n    at SSMProvider.get (/var/task/index.js:151668:19)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async getParameter (/var/task/index.js:148245:20)\n    at async getRunnerGroupId (/var/task/index.js:147798:27)\n    at async createJitConfig (/var/task/index.js:147997:27)\n    at async createStartRunnerConfig (/var/task/index.js:147964:9)\n    at async createRunners (/var/task/index.js:147842:9)\n    at async scaleUp (/var/task/index.js:147915:13)\n    at async Runtime.scaleUpHandler [as handler] (/var/task/index.js:147082:9)"
    }

And a WARN:

SSM Parameter \"/github-action-runners/ephe5170/runners/config/runner-group/Default\"\n         for Runner group Default does not exist

[UPDATE Bis]
After manually creating a parameter called: /github-action-runners/ephe5170/runners/config/runner-group/Default and setting its value to 1 it did work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants