How do you debug jobs that are hanging ? #2931

Fgerthoffert · 2023-02-06T20:56:25Z

Fgerthoffert
Feb 6, 2023

Hi,

We have some jobs that are staying in this state and never get picked up by a runner.

Requested labels: self-hosted
Job defined at: [REDACTED]/.github/workflows/weekly-manual-run.yml@refs/heads/main
Waiting for a runner to pick up this job...

If I follow the trail on cloudwatch (webhook, then scale-up), for these hanging jobs, I always see a line such as:

2023-02-06 19:27:45.033  INFO  [runners:6cdca10f-0e82-597a-93c7-fd54585bc200 index.js:134170  createRunner] Created instance(s):  i-01e4cf11d2e43e8d3 
{
    "runnerType": "Org",
    "runnerOwner": "Jahia",
    "event": "workflow_job",
    "id": "11146663170"
}

So an instance is indeed created, and when I look into that runner jobs, it does seem to have picked up another job (which I understand can be an expected behavior), but no other runner seems to be made available to start that initial job. At the end we end-up in a situations where we have no runners started in AWS, but some jobs waiting for runners.

This seem to happen more frequently these days, but we haven't been able to identify a pattern, re-running the job does fix the issue.

Running v2.1.1 with ephemeral multi-runners.

What would be your recommendation for where to investigate next ? Any tips ?
One thing that has not been clear to me, is how by looking at a runner logs (in Cloudwatch), to identify which exact github workflow/job is being executed. I can connect to the runner itself and see what workflow is running there, but I cannot do that once the runner has been terminated.

Thanks,

Answered by Fgerthoffert

Mar 10, 2023

I believe I finally found the cause of the issue, which was a misconfiguration (or incorrect understanding of a setting) on my end.

In short, the message to look for is:

2023-03-10 15:29:09.728  INFO  [scale-up:a0502c68-f9b9-5d2d-afd6-70edb7486316 index.js:135249  isJobQueued] Job not queued

When multiple jobs were triggered at the same time, a RACE condition between the runners were causing some jobs not to see runners being created.

You'll find more details about the issue and a suggested improvement in this PR: #3046

View full answer

Fgerthoffert · 2023-02-14T19:50:37Z

Fgerthoffert
Feb 14, 2023
Author

Hi,

Any tips on the above?

It's pretty annoying to have jobs hanging due to no runners started.

It would be awesome to get some clues on where to look at to address this. If the issue is not with our env, I'd be more than happy to contribute back with a PR, but the challenge here is not knowing where to look next (couldn't find clew in the Cloudwatch logs, searched for things like "error", "exception", and various other combination in all log groups without success)

Thanks a lot

0 replies

Fgerthoffert · 2023-03-09T09:56:44Z

Fgerthoffert
Mar 9, 2023
Author

This is still very relevant to us, we end up using a standalone runner we use to clear the queue, but we definitely still have some jobs hanging by lack of available runners.

But we're really unsure on how to proceed further to find the root cause of that issue, suggestions/tips would really be appreciated.

0 replies

Fgerthoffert · 2023-03-10T16:34:05Z

Fgerthoffert
Mar 10, 2023
Author

I believe I finally found the cause of the issue, which was a misconfiguration (or incorrect understanding of a setting) on my end.

In short, the message to look for is:

2023-03-10 15:29:09.728  INFO  [scale-up:a0502c68-f9b9-5d2d-afd6-70edb7486316 index.js:135249  isJobQueued] Job not queued

When multiple jobs were triggered at the same time, a RACE condition between the runners were causing some jobs not to see runners being created.

You'll find more details about the issue and a suggested improvement in this PR: #3046

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you debug jobs that are hanging ? #2931

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How do you debug jobs that are hanging ? #2931

Fgerthoffert Feb 6, 2023

Replies: 3 comments

Fgerthoffert Feb 14, 2023 Author

Fgerthoffert Mar 9, 2023 Author

Fgerthoffert Mar 10, 2023 Author

Fgerthoffert
Feb 6, 2023

Fgerthoffert
Feb 14, 2023
Author

Fgerthoffert
Mar 9, 2023
Author

Fgerthoffert
Mar 10, 2023
Author