Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing network for a while can endup with the runner running forever (GH at least) #1014

Closed
DavidGOrtega opened this issue May 24, 2022 · 17 comments
Assignees
Labels
blocked Dependent on something else ci-bitbucket ci-github ci-gitlab cml-runner Subcommand p1-important High priority

Comments

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented May 24, 2022

Despite that we have added the job check on idleTimeout the repo can stuck with the job running and runner status busy until job timeout at least forever in the worst case scenario (confirming...)

@DavidGOrtega DavidGOrtega self-assigned this May 24, 2022
@DavidGOrtega DavidGOrtega added p1-important High priority cml-runner Subcommand p0-critical Max priority (ASAP) ci-gitlab ci-github ci-bitbucket and removed p1-important High priority labels May 24, 2022
@DavidGOrtega
Copy link
Contributor Author

The job was mean to be sleep 120

image

image

@DavidGOrtega
Copy link
Contributor Author

info: Launching github runner
info: runner status {"date":"2022-05-24T09:25:02.621Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status √ Connected to GitHub {"date":"2022-05-24T09:25:02.623Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
warn: SpotNotifier can not be started.
info: runner status Current runner version: '2.292.0' {"date":"2022-05-24T09:25:03.503Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Listening for Jobs {"date":"2022-05-24T09:25:03.504Z","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"ready"}
info: runner status Running job: train {"date":"2022-05-24T09:25:33.331Z","job":"gh","repo":"https://github.com/DavidGOrtega/fashion_mnist","status":"job_started"}
info: runner status Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected. {"date":"2022-05-24T09:29:05.389Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}
info: runner status Runner reconnected. {"date":"2022-05-24T09:30:22.225Z","repo":"https://github.com/DavidGOrtega/fashion_mnist"}

@dacbd
Copy link
Contributor

dacbd commented May 24, 2022

From other experiences with self-hosted runners, this is not really a cml runner issue.

the only change I would recommend would be to start an idle-timeout check after detecting the "reconnected" event.

@dacbd
Copy link
Contributor

dacbd commented May 24, 2022

the crux is that we can't really trust the GH API's job status after something like this.

@DavidGOrtega
Copy link
Contributor Author

Indeed right now the pipeline failed but GH says that the job is still going on (I also checked via api)

image

So our solution to make runner more stable is not worth or very unreliable since we depend on them

@DavidGOrtega
Copy link
Contributor Author

image

In this particular case the runner is idle, so it would me more effective than looking a the job that stays running

image

@dacbd
Copy link
Contributor

dacbd commented May 24, 2022

IDK, this is not an easy case to really handle in the context of cml/what we have right now. If there are network problems then we will likely have problems making the API calls to delete the instance unless the connectivity issue is directly with GitHub, in which case I think the instance should just terminate since it can't get jobs or finish any that it has (in a meaningful way)...

@DavidGOrtega
Copy link
Contributor Author

think the instance should just terminate since it can't get jobs or finish any that it has

I agree. If the conn is lost the runner must shutdown to avoid having a cloud runner running forever. Having a flag to be able to force to continue

@casperdcl
Copy link
Contributor

instance termination on error seems sensible (to avoid spiralling costs).

Not sure about debugging though (some users may prefer to not have runners shut down? maybe add a flag to prevent auto-termination?)

@dacbd dacbd changed the title Lossing network for a while can endup with the runner running forever (GH at least) Losing network for a while can endup with the runner running forever (GH at least) May 25, 2022
@DavidGOrtega
Copy link
Contributor Author

Just for curiosity. My pipeline still displays the job as running

@dacbd
Copy link
Contributor

dacbd commented Jun 2, 2022

Just for curiosity. My pipeline still displays the job as running

I think it will for awhile 🙈 😆

@casperdcl
Copy link
Contributor

casperdcl commented Jun 20, 2022

While reviewing the 72h -> 35d GHA self-hosted runner timeout change (#1064), stumbled on:

A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days.1

So even GH eventually shuns unreachable runners. We should certainly shut them down.

Footnotes

  1. https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#about-self-hosted-runners

@casperdcl casperdcl added p1-important High priority and removed p0-critical Max priority (ASAP) labels Jun 28, 2022
@dacbd
Copy link
Contributor

dacbd commented Oct 24, 2022

I don't think we have anything actionable here? When the connection is lost, and the agent can't resume the connection the process should exit and the runner terminate. If the network is completely disconnected, I think we can safely say that is beyond the scope of what can be handled.

@casperdcl
Copy link
Contributor

casperdcl commented Oct 28, 2022

safely say that is beyond the scope

Disagree; network disconnects are clearly sometimes possible, and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate. Related: iterative/terraform-provider-iterative#289.

But I have a feeling I haven't understood correctly.

@dacbd
Copy link
Contributor

dacbd commented Oct 31, 2022

network disconnects are clearly sometimes possible

correct and that shouldn't break it. I have observed github actions handle minor network interruptions just fine.

and when (not if) they occur it results in orphaned forever-running cloud instances? That's bad. The cloud instances should be able to self-terminate.

hiccups are not the issue, the only time I've seen the described behavior is when something went very wrong with the instance. So my point is that seg faulting, OOM (to a lesser extent), and pulling the network plug are not things we can recover from. If no network connection exists we can't make the API calls to delete the machine.

@casperdcl
Copy link
Contributor

I suspect we're having multiple different conversations here xD

@casperdcl casperdcl added the blocked Dependent on something else label Nov 8, 2022
@dacbd
Copy link
Contributor

dacbd commented Nov 9, 2022

I think from our internal discussion we can close this as not planned, If someone disagrees go ahead and re-open (preferably with clear case or something reproducible 😉 )

@dacbd dacbd closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Dependent on something else ci-bitbucket ci-github ci-gitlab cml-runner Subcommand p1-important High priority
Projects
None yet
Development

No branches or pull requests

3 participants