-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Losing network for a while can endup with the runner running forever (GH at least) #1014
Comments
info: Launching github runner
info: runner status {"date":"2022-05-24T09:25:02.621Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status √ Connected to GitHub {"date":"2022-05-24T09:25:02.623Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
warn: SpotNotifier can not be started.
info: runner status Current runner version: '2.292.0' {"date":"2022-05-24T09:25:03.503Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Listening for Jobs {"date":"2022-05-24T09:25:03.504Z","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"ready"}
info: runner status Running job: train {"date":"2022-05-24T09:25:33.331Z","job":"gh","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"job_started"}
info: runner status Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected. {"date":"2022-05-24T09:29:05.389Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Runner reconnected. {"date":"2022-05-24T09:30:22.225Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"} |
From other experiences with self-hosted runners, this is not really a the only change I would recommend would be to start an idle-timeout check after detecting the "reconnected" event. |
the crux is that we can't really trust the GH API's job status after something like this. |
IDK, this is not an easy case to really handle in the context of cml/what we have right now. If there are network problems then we will likely have problems making the API calls to delete the instance unless the connectivity issue is directly with GitHub, in which case I think the instance should just terminate since it can't get jobs or finish any that it has (in a meaningful way)... |
I agree. If the conn is lost the runner must shutdown to avoid having a cloud runner running forever. Having a flag to be able to force to continue |
instance termination on error seems sensible (to avoid spiralling costs). Not sure about debugging though (some users may prefer to not have runners shut down? maybe add a flag to prevent auto-termination?) |
Just for curiosity. My pipeline still displays the job as running |
I think it will for awhile 🙈 😆 |
While reviewing the 72h -> 35d GHA self-hosted runner timeout change (#1064), stumbled on:
So even GH eventually shuns unreachable runners. We should certainly shut them down. Footnotes |
I don't think we have anything actionable here? When the connection is lost, and the agent can't resume the connection the process should exit and the runner terminate. If the network is completely disconnected, I think we can safely say that is beyond the scope of what can be handled. |
Disagree; network disconnects are clearly sometimes possible, and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate. Related: iterative/terraform-provider-iterative#289. But I have a feeling I haven't understood correctly. |
correct and that shouldn't break it. I have observed github actions handle minor network interruptions just fine.
hiccups are not the issue, the only time I've seen the described behavior is when something went very wrong with the instance. So my point is that seg faulting, OOM (to a lesser extent), and pulling the network plug are not things we can recover from. If no network connection exists we can't make the API calls to delete the machine. |
I suspect we're having multiple different conversations here xD |
I think from our internal discussion we can close this as not planned, If someone disagrees go ahead and re-open (preferably with clear case or something reproducible 😉 ) |
Despite that we have added the job check on idleTimeout the repo can stuck with the job running and runner status busy until job timeout at least forever in the worst case scenario (confirming...)
The text was updated successfully, but these errors were encountered: