-
Notifications
You must be signed in to change notification settings - Fork 967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The self-hosted runner: xxx lost communication with the server #3539
Comments
We are also getting the same error since couple of days - |
I am facing the same error on AWS Codebuild. |
We are facing the same issue |
+1 with both v2.320.0 and v2.321.0 |
We have also started to encounter this in the past 24 hours |
We started having this issue this morning - upgrading to v2.321.0 seems to have resolved it |
We already upgraded to v2.321.0 a few days ago but still run into this issue from time to time. |
We also keep encountering it in the past 3-4 days |
This sounds familiar to what we're experiencing and it started happening this week. We're not using CodeBuild, but runners running on ECS (https://github.com/CloudSnorkel/cdk-github-runners/) - jobs are picked up and completed, but nothing is being reported back to GitHub. The symptoms started showing up as a hanging step that never terminated and would hit a timeout. Right now, I've taken down all resources and deployed everything from scratch and now nothing is being reported back to GitHub at all. All runs just get stuck at "Waiting for a runner to pick up this job...", but the jobs still run. We'll try switching back to GitHub hosted runners in the morning and see if we have time to troubleshoot a bit more. It's super weird with the step that never terminates, we tried to drop in a We operate in Adding a bit more context - to my surprise I woke up with a run that was retriggered automatically 9 hours later - everything was reporting as intended. The logs for both runs are here: They look somewhat identical (don't mind them being canceled - the run is intended to fail). The first run was never detected as being picked up, but the second run worked as intended. |
My issue is the same that was reported on issue 2624. That issue was
closed
without a solution.We use AWS Codebuild as the self hosted platform.
This issue is happening on my repository too, frequently. I don't think the EC2 instance is starving and dying because the issue happens in different steps. We use EC2 large (8vCPUs 15GB of memory) to run the workflow.
Sometimes one step completes until the end, sometimes it is aborted in the middle.
The workflow is bellow:
It may be just a coincidence, but when I saw this issue happen, one of the workflows finished with error (it can completely, but finished with error, say, because some unit test failed). Then the other job that was running on the other runner stops executing in the middle. And then in the "Annotations" section I have that message: "The self-hosted runner: b4ac7d30-8387-4499-a899-f75d06e2941f lost communication with the server."
When I go check the logs of that runner on AWS, there is no error message. The build just stops running in the middle.
The error does not happen in a single job. It happens in any of the jobs on that workflow. Some jobs complete successfully, there is one that completes with an error (like unit test, type checking, linting error), and another job that is apparently aborted in the middle
This was the log on the aws runner (for one of the cases this issue hapened, this time, on the linting job):
The text was updated successfully, but these errors were encountered: