-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM runner has been stuck for multiple days #80
Comments
It managed to hang up right as I was leaving on a long weekend trip, and I didn't notice until I got back. I wanted to get a fresh runner going anyway, for the latest Windows Updates, but was waiting until I got back to try to avoid any issues while I was gone 😁. New runner is going now |
@jeremyd2019 Would you like to check if this CI job is stuck again https://github.com/msys2-arm/msys2-autobuild/actions/runs/6570301888 ? |
I've been ruminating on the idea of some sort of 'watchdog' to detect and kill stuck pacman processes automatically, but I haven't settled on the best language/technology to do so. It seems like python would be most convenient since autobuild is already python, I could put a background thread like I did to try polling the token, but I'm not familiar with process querying/killing modules. What I've got so far is a cygwin commands to get the cygwin pid of the process I want to kill (what I really want is the child pacman process, this gets the newest pacman process older than 1800 seconds)
coupled with the script I already had (because when stuck in this state cygwin kill is not sufficient) |
It would be a bit clear if the reason of such CI failure is explained. |
lost power, so any lack of runner in the near future will be due to that power is back |
unstuck it. the powershell variant in git-for-windows/git-for-windows-automation#61 (comment) was intriguing, it seems like it could be close to being turned into a 'watchdog', would just need to also query CreationDate field to see any pacman processes that have been running a long time (like a half hour? or hour?), and then arrange for it to run continuously (scheduled task?). Of course, I'd much rather get whatever bug is causing this fixed... |
There's a stuck job now, but it doesn't seem to be the runner this time. Probably something on Github's end. |
This seems to be a different issue. I think maybe the machine rebooted. I did a quick check and didn't notice any excess packages installed. |
"echo: write error: No space left on device" |
What?!? I deleted some of the cruft under |
Is #76 related? Otherwise, try good old WinDirStat :) |
seems to have gotten unstuck and errored out after some hours. (or anyone poked at it?) Unrelated note: Runner groups are now available for everyone it seems. Not that it makes much difference with the current setup with a separate org, but good to know: https://github.blog/changelog/2024-10-17-actions-runner-groups-now-available-for-organizations-on-free-plan/ |
I killed the child pacman process, as usual.
Yeah, I could do away with the extra labels to differentiate between autobuild and CI instances and enforce it with runner groups instead, presumably. |
It looks like it stuck again at |
Is it stuck again? This time with |
yep, killed |
gpg is weirdly bugged right now it seems |
I killed dirmngr and keyboxd processes, and deleted ~/.gnupg, hopefully will make it happier. I'm clearing all the failed builds and am starting a build. I also ran pacman -Scc since there has been a lot of package thrashing lately, maybe the disk was getting full? Hmm, wasn't there a gnupg update recently? Maybe the dirmngr/keyboxd processes were still from the old version (since the same install is reused for each run), and that was causing issues? |
Looks like it happened again. Killed pacman and dirmngr/keyboxd again. |
yeah, it aligns with the update.. sadly |
It looks like the runner is stuck again. This time |
During setup-msys2... that's different... |
This CI job is running for days https://github.com/msys2-arm/msys2-autobuild/actions/runs/6508662089
@jeremyd2019
The text was updated successfully, but these errors were encountered: