-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI tests failing with Julia nightly / Ubuntu #2187
Comments
Ah, the first two failing runs I linked above used Julia built from JuliaLang/julia@8b19f5fa276 resp. JuliaLang/julia@1eee6ef7c83. The last good one I could locate used JuliaLang/julia@cd394eb68d4 which directly precedes JuliaLang/julia@8b19f5fa276 ... The means the failures started with JuliaLang/julia@8b19f5fa276 which in turn comes from JuliaLang/julia#49185 |
(Though this could still be a coincidence; even if Julia causes it, it could be a previous PR that introduced the problem and just doesn't always fail. Or I might have screwed up the analysis somehow) |
These errors look very much like a memory problem, there were also some runs where some of schemes tests took extremely long which also might have been due to memory starvation. The commit you mentioned (or at least parts of it) was backported to 1.9rc2, which was tagged yesterday but did not show up in the github actions yet. So if this is to blame we should see this happen for 1.9 as well soon. Edit: I found the extremely long one again:
In a successful one also the toric schemes file, that runs after the K3 tests, took a lot shorter:
|
I did some tests in a memory-restricted cgroup (to about 9.5GB) and got this crash:
This does match one of the crashes of the CI failures, except that the error is different, mine was killed like by the cgroup manager:
In the CI we get |
I did some more testing:
With julia 1.9-rc2 I can run the full Oscar testsuite in a cgroup that is limited to 6GB, stats from close to the end of the tests:
If I try the same with julia master it fails like this:
And even if I increase the cgroup limit to 10GB it still fails:
Still on julia master but once I revert the second commit of JuliaLang/julia#49185 the tests pass again:
|
This is quite odd because rc2 has both the commits, so maybe there's two things going on JuliaLang/julia#48935. It's the last two commits here. It does look like we might be swapping or something like that. |
@benlorenz can you try both cases with |
@gbaraldi Sorry for the delay, I have now created logs including the https://gist.github.com/benlorenz/379cb72e6760d2f0d6f71e1e10fa082a I hope the filenames are self-explanatory, all runs that were killed by the oom-killer should have cgroupsThe 6gb logs were run with:
The 10gb logs with 10000000000 (10500000000) as limit. other stats output in the logsFor each file in the Oscar testsuite the output contains a short memory summary from before the retries for 1.9rc1 and 1.9rc2For both 1.9 versions the tests oom-crashed during the first try with 6gb but succeeded on the second try. failed runs on masterFor master I tried 6gb and 10gb each 6 times and all of them failed, the 6gb ones quite early (1000 lines of log output, instead of about 3000 for a successful run). master+revertFor the current tests I reverted only the v1.8I also tried running it in 1.8 but that didn't work with my cgroup setup as it seems to use the free physical memory instead of the constrained memory:
Then the job is killed after only a few minutes. github actionsI also ran the oscar testsuite with
Edit:Until I can run the 1.8 tests locally, here is a run with GC logging on github actions for 1.8.5, 1.9-rc2, and nightly. (Nightly is still running with very long GC pauses...) |
It seems my heuristic might be a bit too greedy which can be adjusted, but I'm not sure why it's so much worse on master, I would look into why master is allocating so much more memory because it seems to be a separate regression. |
Well, we bisected it to that commit, but of course it is always possible that the issue we are experiencing is caused by multiple commits combined; then the commit we flagged is just the last puzzle piece, but maybe not the core cause? |
Of course there is also always the possibility that there were API/ABI changes in the Julia kernel and we need to recompile all our JLLs for latest Julia (and for that we need a new libjulia_jll). However, the fact that Julia nightly on macOS is not regressed is peculiar and kinda contradicts that theory. Hmm |
The change did make the GC a bit more greedy because in a specific workload it was being too agressive, maybe here's the opposite and it's being a bit too lazy, I will take a further look.
|
The latest changes to the GC on master seem to have fixed (or at least significantly improved) the issue, maybe JuliaLang/julia#49315 ? From last night, this one took 1h 44min
while a few commits later with
the tests ran in 59 minutes. And all following runs were similar, no crashes anymore so far. That approximately one hour duration matches the other julia versions (only 1.6 is significantly slower since it includes the doctests). I have run it in my memory-constrained cgroup again and it also did succeed 3 times in a row with the low limit of 6GB. |
I don't understand the details how this works but this PR did fix the issue. I haven't seen any timeouts in the past week. |
Sometimes it is timeouts (with the Ubuntu run taking muuuch longer than the macOS run). Sometimes it is crashes, e.g. in this run or this one:
For yet another run we have this crash log:
Perhaps there was a change in Julia master that requires a rebuild of libjulia_jll and all the JLLs made with it.
Perhaps something worse changed sigh.
It was still passing in
The first failures I see are in:
that could narrow down which Julia changes are responsible (if any), though of course it might also be that we were just lucky those times...
It seems for Julia's Nanosoldier the package tests are limited to 45 minutes. As a result, Oscar is currently failing Nanosoldier testing anyway. That's obviously bad for us, because it means the Julia won't notice if they break Oscar. I guess that's yet another strong reason to make our (default) test suite faster out of the box....
The text was updated successfully, but these errors were encountered: