Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Retry aws metadata token download #3292

Merged
merged 1 commit into from
Jul 20, 2023
Merged

Conversation

sdarwin
Copy link
Contributor

@sdarwin sdarwin commented May 25, 2023

Testing Windows 2019 using multi-runner and these image scripts https://github.com/philips-labs/terraform-aws-github-runner/tree/main/images/windows-core-2019 . Installed a number of choco packages in the AMI such as cmake, python, curl, visual studio. Ran github actions. Most of the time, the windows 2019 runners would be launched but never actually process any jobs. The jobs would be frozen forever.

Log into the server and checked runner-startup.log. Here is the output:

Retrieving TOKEN from AWS API
PS>TerminatingError(Invoke-RestMethod): "Unable to connect to the remote server"
Invoke-RestMethod : Unable to connect to the remote server
At C:\start-runner.ps1:21 char:8
+ $token=Invoke-RestMethod -Method PUT -Uri "http://169.254.169.254/lat ...
+        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod],
WebException
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand
Invoke-RestMethod : Unable to connect to the remote server
At C:\start-runner.ps1:21 char:8
+ $token=Invoke-RestMethod -Method PUT -Uri "http://169.254.169.254/lat ...
+        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-RestMethod], WebExc
   eption
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand

>> TerminatingError(Invoke-RestMethod): "Object reference not set to an instance of an object."
Invoke-RestMethod : Object reference not set to an instance of an object.
At C:\start-runner.ps1:23 char:9
+ $ami_id=Invoke-RestMethod -Uri "http://169.254.169.254/latest/meta-da ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Invoke-RestMethod], NullReferenceException
    + FullyQualifiedErrorId : System.NullReferenceException,Microsoft.PowerShell.Commands.InvokeRestMethodCommand
Invoke-RestMethod : Object reference not set to an instance of an object.
At C:\start-runner.ps1:23 char:9
+ $ami_id=Invoke-RestMethod -Uri "http://169.254.169.254/latest/meta-da ...

What's happening is it fails to retrieve the aws metadata token on the first attempt, and then all subsequent actions fail.

The solution was to add a retry when requesting the metadata token. This fixed the problem.

For other operating systems besides Windows 2019, even if they usually succeed, adding a retry doesn't hurt anything, and helps in case there are any random connection failures.

@sdarwin sdarwin changed the title Retry aws metadata token download fix: Retry aws metadata token download May 25, 2023
@sdarwin
Copy link
Contributor Author

sdarwin commented May 29, 2023

Additional information about this fix. Based on testing, the problem occurred:

  • frequently on windows 2019
  • infrequently on windows 2022
  • not on Ubuntu

Newer versions of Powershell support a -MaximumRetryCount flag, which would probably be preferable, but the default powershell doesn't have that option. So a simple loop mechanism instead. In any case, adding retry to a network download makes it more resilient.

@sdarwin
Copy link
Contributor Author

sdarwin commented Jun 16, 2023

By the way, the update in this pull request was definitely necessary for me. Please consider these reasons to merge the changes:

  1. wget will retry any downloads, by default. curl won't retry, without adding the retry flag. In order to make CI generally more robust, it can be a good idea to search for any instances of curl and add retries. Everywhere. Why have failed CI jobs due to sporadic network problems that could easily be fixed by using retries. So the idea is to retry all the time.

  2. The issue seems mostly related to Windows 2019. In your environment, are you extensively using Windows 2019 with many pre-installed packages? If not, it's within the realm of possibility the issue is affecting windows users, but it's not getting much attention because Linux is more prevalent. But it's still worthwhile to fix a small windows bug.

Thanks.

@npalm npalm self-requested a review June 30, 2023 11:25
@npalm
Copy link
Member

npalm commented Jun 30, 2023

Thanks for creating the PR. I will check the PR asap. Sorry for the delay

Copy link
Member

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, tested with the mult-runners

@npalm npalm merged commit 5537474 into philips-labs:main Jul 20, 2023
@sdarwin
Copy link
Contributor Author

sdarwin commented Aug 1, 2023

@npalm thanks for merging. We are starting to deploy runners, and generally everything is working. However, there have been a couple cases where a workflow gets stuck because it's missing one (or two) runners. All the jobs will be finished except for one job that doesn't have a runner. I investigated today and found in the scale-down logs: "Runner 'i-04f4184eb5f7660d6' is orphaned and will be removed." I believe that was a similar symptom addressed in this pull request. However, now it's Ubuntu 22, not Windows 2019. Going back to this pull request, all of the focus was on fixing Windows 2019. Linux had been working. Adding the "--retry" on curl, while safe, and a good idea, didn't really get stress tested because Ubuntu was always succeeding. Now what seems to be happening is 1% of the time, infrequently, Ubuntu is failing. I believe curl "--retry" doesn't retry everything. Some errors will get retried, other errors will not get retried. And so, the fix may be to apply the full logic on Ubuntu. That is: check the results of the $token. If $token is empty, go into a loop, and re-request the token. What was done for Windows. So, I will test that, and then send over another update, to review.

@npalm
Copy link
Member

npalm commented Aug 2, 2023

Feel free to make a PR for other scripts as well, thanks for the feedback.

npalm pushed a commit that referenced this pull request Aug 8, 2023
Adding more thorough logic to fully retry the aws metadata token
download, on Linux. Continuing
#3292.
See explanation there.

I have tested and it works, however Ubuntu usually succeeds. We will
observe further how this goes or if more changes are needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants