Losing network for a while can endup with the runner running forever (GH at least) #1014

DavidGOrtega · 2022-05-24T09:46:39Z

Despite that we have added the job check on idleTimeout the repo can stuck with the job running and runner status busy until job timeout at least forever in the worst case scenario (confirming...)

DavidGOrtega · 2022-05-24T09:49:12Z

The job was mean to be sleep 120

DavidGOrtega · 2022-05-24T09:49:50Z

info: Launching github runner
info: runner status {"date":"2022-05-24T09:25:02.621Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status √ Connected to GitHub {"date":"2022-05-24T09:25:02.623Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
warn: SpotNotifier can not be started.
info: runner status Current runner version: '2.292.0' {"date":"2022-05-24T09:25:03.503Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Listening for Jobs {"date":"2022-05-24T09:25:03.504Z","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"ready"}
info: runner status Running job: train {"date":"2022-05-24T09:25:33.331Z","job":"gh","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"job_started"}
info: runner status Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected. {"date":"2022-05-24T09:29:05.389Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Runner reconnected. {"date":"2022-05-24T09:30:22.225Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}

dacbd · 2022-05-24T14:42:30Z

From other experiences with self-hosted runners, this is not really a cml runner issue.

the only change I would recommend would be to start an idle-timeout check after detecting the "reconnected" event.

dacbd · 2022-05-24T14:48:37Z

the crux is that we can't really trust the GH API's job status after something like this.

DavidGOrtega · 2022-05-24T15:44:17Z

Indeed right now the pipeline failed but GH says that the job is still going on (I also checked via api)

So our solution to make runner more stable is not worth or very unreliable since we depend on them

DavidGOrtega · 2022-05-24T15:45:59Z

In this particular case the runner is idle, so it would me more effective than looking a the job that stays running

dacbd · 2022-05-24T15:52:48Z

IDK, this is not an easy case to really handle in the context of cml/what we have right now. If there are network problems then we will likely have problems making the API calls to delete the instance unless the connectivity issue is directly with GitHub, in which case I think the instance should just terminate since it can't get jobs or finish any that it has (in a meaningful way)...

DavidGOrtega · 2022-05-24T16:37:45Z

think the instance should just terminate since it can't get jobs or finish any that it has

I agree. If the conn is lost the runner must shutdown to avoid having a cloud runner running forever. Having a flag to be able to force to continue

casperdcl · 2022-05-24T17:15:52Z

instance termination on error seems sensible (to avoid spiralling costs).

Not sure about debugging though (some users may prefer to not have runners shut down? maybe add a flag to prevent auto-termination?)

DavidGOrtega · 2022-05-25T15:06:32Z

Just for curiosity. My pipeline still displays the job as running

dacbd · 2022-06-02T00:04:38Z

Just for curiosity. My pipeline still displays the job as running

I think it will for awhile 🙈 😆

casperdcl · 2022-06-20T08:54:12Z

While reviewing the 72h -> 35d GHA self-hosted runner timeout change (#1064), stumbled on:

A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days.¹

So even GH eventually shuns unreachable runners. We should certainly shut them down.

https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#about-self-hosted-runners ↩

dacbd · 2022-10-24T17:10:45Z

I don't think we have anything actionable here? When the connection is lost, and the agent can't resume the connection the process should exit and the runner terminate. If the network is completely disconnected, I think we can safely say that is beyond the scope of what can be handled.

casperdcl · 2022-10-28T22:44:10Z

safely say that is beyond the scope

Disagree; network disconnects are clearly sometimes possible, and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate. Related: iterative/terraform-provider-iterative#289.

But I have a feeling I haven't understood correctly.

dacbd · 2022-10-31T14:06:28Z

network disconnects are clearly sometimes possible

correct and that shouldn't break it. I have observed github actions handle minor network interruptions just fine.

and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate.

hiccups are not the issue, the only time I've seen the described behavior is when something went very wrong with the instance. So my point is that seg faulting, OOM (to a lesser extent), and pulling the network plug are not things we can recover from. If no network connection exists we can't make the API calls to delete the machine.

casperdcl · 2022-11-04T22:43:26Z

I suspect we're having multiple different conversations here xD

dacbd · 2022-11-09T16:11:40Z

I think from our internal discussion we can close this as not planned, If someone disagrees go ahead and re-open (preferably with clear case or something reproducible 😉 )

DavidGOrtega self-assigned this May 24, 2022

DavidGOrtega added p1-important High priority cml-runner Subcommand p0-critical Max priority (ASAP) ci-gitlab ci-github ci-bitbucket and removed p1-important High priority labels May 24, 2022

dacbd mentioned this issue May 24, 2022

runner process management & tracking #1016

Open

4 tasks

dacbd changed the title ~~Lossing network for a while can endup with the runner running forever (GH at least)~~ Losing network for a while can endup with the runner running forever (GH at least) May 25, 2022

DavidGOrtega mentioned this issue Jun 9, 2022

The self-hosted runner: cml-xxx lost communication with the server #1053

Closed

casperdcl added p1-important High priority and removed p0-critical Max priority (ASAP) labels Jun 28, 2022

casperdcl added the blocked Dependent on something else label Nov 8, 2022

dacbd closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Losing network for a while can endup with the runner running forever (GH at least) #1014

Losing network for a while can endup with the runner running forever (GH at least) #1014

DavidGOrtega commented May 24, 2022 •

edited

Loading

DavidGOrtega commented May 24, 2022

DavidGOrtega commented May 24, 2022

dacbd commented May 24, 2022

dacbd commented May 24, 2022

DavidGOrtega commented May 24, 2022

DavidGOrtega commented May 24, 2022

dacbd commented May 24, 2022

DavidGOrtega commented May 24, 2022

casperdcl commented May 24, 2022

DavidGOrtega commented May 25, 2022

dacbd commented Jun 2, 2022

casperdcl commented Jun 20, 2022 •

edited

Loading

dacbd commented Oct 24, 2022

casperdcl commented Oct 28, 2022 •

edited

Loading

dacbd commented Oct 31, 2022

casperdcl commented Nov 4, 2022

dacbd commented Nov 9, 2022

Losing network for a while can endup with the runner running forever (GH at least) #1014

Losing network for a while can endup with the runner running forever (GH at least) #1014

Comments

DavidGOrtega commented May 24, 2022 • edited Loading

DavidGOrtega commented May 24, 2022

DavidGOrtega commented May 24, 2022

dacbd commented May 24, 2022

dacbd commented May 24, 2022

DavidGOrtega commented May 24, 2022

DavidGOrtega commented May 24, 2022

dacbd commented May 24, 2022

DavidGOrtega commented May 24, 2022

casperdcl commented May 24, 2022

DavidGOrtega commented May 25, 2022

dacbd commented Jun 2, 2022

casperdcl commented Jun 20, 2022 • edited Loading

Footnotes

dacbd commented Oct 24, 2022

casperdcl commented Oct 28, 2022 • edited Loading

dacbd commented Oct 31, 2022

casperdcl commented Nov 4, 2022

dacbd commented Nov 9, 2022

DavidGOrtega commented May 24, 2022 •

edited

Loading

casperdcl commented Jun 20, 2022 •

edited

Loading

casperdcl commented Oct 28, 2022 •

edited

Loading