-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Retry aws metadata token download #3292
Conversation
Additional information about this fix. Based on testing, the problem occurred:
Newer versions of Powershell support a |
By the way, the update in this pull request was definitely necessary for me. Please consider these reasons to merge the changes:
Thanks. |
Thanks for creating the PR. I will check the PR asap. Sorry for the delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, tested with the mult-runners
@npalm thanks for merging. We are starting to deploy runners, and generally everything is working. However, there have been a couple cases where a workflow gets stuck because it's missing one (or two) runners. All the jobs will be finished except for one job that doesn't have a runner. I investigated today and found in the scale-down logs: "Runner 'i-04f4184eb5f7660d6' is orphaned and will be removed." I believe that was a similar symptom addressed in this pull request. However, now it's Ubuntu 22, not Windows 2019. Going back to this pull request, all of the focus was on fixing Windows 2019. Linux had been working. Adding the "--retry" on curl, while safe, and a good idea, didn't really get stress tested because Ubuntu was always succeeding. Now what seems to be happening is 1% of the time, infrequently, Ubuntu is failing. I believe curl "--retry" doesn't retry everything. Some errors will get retried, other errors will not get retried. And so, the fix may be to apply the full logic on Ubuntu. That is: check the results of the $token. If $token is empty, go into a loop, and re-request the token. What was done for Windows. So, I will test that, and then send over another update, to review. |
Feel free to make a PR for other scripts as well, thanks for the feedback. |
Adding more thorough logic to fully retry the aws metadata token download, on Linux. Continuing #3292. See explanation there. I have tested and it works, however Ubuntu usually succeeds. We will observe further how this goes or if more changes are needed.
Testing Windows 2019 using multi-runner and these image scripts https://github.com/philips-labs/terraform-aws-github-runner/tree/main/images/windows-core-2019 . Installed a number of choco packages in the AMI such as cmake, python, curl, visual studio. Ran github actions. Most of the time, the windows 2019 runners would be launched but never actually process any jobs. The jobs would be frozen forever.
Log into the server and checked runner-startup.log. Here is the output:
What's happening is it fails to retrieve the aws metadata token on the first attempt, and then all subsequent actions fail.
The solution was to add a retry when requesting the metadata token. This fixed the problem.
For other operating systems besides Windows 2019, even if they usually succeed, adding a retry doesn't hurt anything, and helps in case there are any random connection failures.