Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI tests failing with Julia nightly / Ubuntu #2187

Closed
fingolfin opened this issue Apr 2, 2023 · 14 comments
Closed

CI tests failing with Julia nightly / Ubuntu #2187

fingolfin opened this issue Apr 2, 2023 · 14 comments
Labels
testsuite Everything concerning testsuite

Comments

@fingolfin
Copy link
Member

Sometimes it is timeouts (with the Ubuntu run taking muuuch longer than the macOS run). Sometimes it is crashes, e.g. in this run or this one:

Test Summary:                          | Pass  Total  Time
primary decomposition of ideal sheaves |   14     14  7.7s
-> Testing AlgebraicGeometry/Schemes/IdealSheaves.jl took: runtime 23.529 seconds + compilation 13.797 seconds + recompilation 0.000 seconds, 740.78 MB

[1864] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: The operation was canceled.

For yet another run we have this crash log:

Test Summary:       | Pass  Total  Time
Oscar.blow_up_chart |    6      6  4.6s

[1866] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: The operation was canceled.

Perhaps there was a change in Julia master that requires a rebuild of libjulia_jll and all the JLLs made with it.

Perhaps something worse changed sigh.

It was still passing in

The first failures I see are in:

that could narrow down which Julia changes are responsible (if any), though of course it might also be that we were just lucky those times...


It seems for Julia's Nanosoldier the package tests are limited to 45 minutes. As a result, Oscar is currently failing Nanosoldier testing anyway. That's obviously bad for us, because it means the Julia won't notice if they break Oscar. I guess that's yet another strong reason to make our (default) test suite faster out of the box....

@fingolfin
Copy link
Member Author

Ah, the first two failing runs I linked above used Julia built from JuliaLang/julia@8b19f5fa276 resp. JuliaLang/julia@1eee6ef7c83. The last good one I could locate used JuliaLang/julia@cd394eb68d4 which directly precedes JuliaLang/julia@8b19f5fa276 ...

The means the failures started with JuliaLang/julia@8b19f5fa276 which in turn comes from JuliaLang/julia#49185

@fingolfin
Copy link
Member Author

(Though this could still be a coincidence; even if Julia causes it, it could be a previous PR that introduced the problem and just doesn't always fail. Or I might have screwed up the analysis somehow)

@benlorenz
Copy link
Member

benlorenz commented Apr 2, 2023

These errors look very much like a memory problem, there were also some runs where some of schemes tests took extremely long which also might have been due to memory starvation.

The commit you mentioned (or at least parts of it) was backported to 1.9rc2, which was tagged yesterday but did not show up in the github actions yet. So if this is to blame we should see this happen for 1.9 as well soon.

Edit: I found the extremely long one again:

Test Summary:                                      | Pass  Total      Time
resolution of singularities of elliptic fibrations |    3      3  50m09.2s
-> Testing AlgebraicGeometry/Schemes/K3.jl took: runtime 2935.774 seconds + compilation 73.669 seconds + recompilation 0.000 seconds, 39549.45 MB
Test Summary: | Pass  Total     Time
Toric schemes |   13     13  5m51.4s
-> Testing AlgebraicGeometry/Schemes/ToricSchemes.jl took: runtime 130.279 seconds + compilation 224.928 seconds + recompilation 0.000 seconds, 1141.45 MB

In a successful one also the toric schemes file, that runs after the K3 tests, took a lot shorter:

Test Summary:                                      | Pass  Total     Time
resolution of singularities of elliptic fibrations |    3      3  7m03.1s
-> Testing AlgebraicGeometry/Schemes/K3.jl took: runtime 387.557 seconds + compilation 35.557 seconds + recompilation 0.000 seconds, 39559.17 MB
Test Summary: | Pass  Total     Time
Toric schemes |   13     13  1m50.0s
-> Testing AlgebraicGeometry/Schemes/ToricSchemes.jl took: runtime 87.058 seconds + compilation 22.924 seconds + recompilation 0.000 seconds, 964.97 MB

@benlorenz
Copy link
Member

I did some tests in a memory-restricted cgroup (to about 9.5GB) and got this crash:

-> Testing AlgebraicGeometry/Schemes/WeilDivisor.jl took: runtime 6.341 seconds + compilation 9.645 seconds + recompilation 0.000 seconds, 1839.21 MB
Test Summary: | Pass  Total  Time
blowups       |    2      2  2.8s
Test Summary:       | Pass  Total  Time
Oscar.blow_up_chart |    6      6  0.6s
GC: pause 4853.59ms. collected 2331.602373MB. full
Killed

This does match one of the crashes of the CI failures, except that the error is different, mine was killed like by the cgroup manager:

[23582.938724] Memory cgroup out of memory: Killed process 15123 (julia) total-vm:14826076kB, anon-rss:9201796kB, file-rss:372256kB, shmem-rss:0kB, UID:2779 pgtables:21704kB oom_score_adj:0

In the CI we get Terminated messages instead but the cause might be similar. Note that the ubuntu-runners only have 7GB physical memory.
(macOS has 14GB which might explain why we don't see this there.)

@benlorenz
Copy link
Member

I did some more testing:

  • The good news is that even though these commits were backported to 1.9, version 1.9.0-rc2 seems unaffected (both in my local tests and in the CI).
  • The bad news is that this means there are more commits involved for this to trigger and I don't really know which.

With julia 1.9-rc2 I can run the full Oscar testsuite in a cgroup that is limited to 6GB, stats from close to the end of the tests:

GC live:    1440.386 MiB 
JIT:          74.373 MiB 
Max. RSS:   5675.137 MiB 
Free mem:     96.613 MiB 
...
     Testing Oscar tests passed

If I try the same with julia master it fails like this:

GC live:    3460.915 MiB
JIT:          19.764 MiB
Max. RSS:   5959.289 MiB
Free mem:      0.090 MiB
...
ERROR: Package Oscar errored during testing (received signal: KILL)

And even if I increase the cgroup limit to 10GB it still fails:

GC live:    4345.790 MiB
JIT:          56.893 MiB
Max. RSS:   9572.812 MiB
Free mem:     99.105 MiB
Test Summary:                | Pass  Total  Time
fraction fields of varieties |    6      6  0.3s
ERROR: Package Oscar errored during testing (received signal: KILL)

Still on julia master but once I revert the second commit of JuliaLang/julia#49185 the tests pass again:

GC live:    1400.914 MiB
JIT:          74.460 MiB
Max. RSS:   5743.348 MiB
Free mem:     80.625 MiB
...
     Testing Oscar tests passed

@gbaraldi
Copy link

gbaraldi commented Apr 6, 2023

This is quite odd because rc2 has both the commits, so maybe there's two things going on JuliaLang/julia#48935. It's the last two commits here. It does look like we might be swapping or something like that.

@gbaraldi
Copy link

gbaraldi commented Apr 6, 2023

@benlorenz can you try both cases with GC.enable_logging() and possibly also 1.9-rc1? Maybe even 1.8?

@benlorenz
Copy link
Member

benlorenz commented Apr 8, 2023

@gbaraldi Sorry for the delay, I have now created logs including the GC: pause ... output for master, master with revert of 450c1faf, 1.9.0-rc2, and 1.9.0-rc1:

https://gist.github.com/benlorenz/379cb72e6760d2f0d6f71e1e10fa082a
(I recommend downloading them as zip and reading them locally, the Github webpage seems to struggle with the large log files)

I hope the filenames are self-explanatory, all runs that were killed by the oom-killer should have Killed in the filename and include the corresponding line from the kernel log at the end.

cgroups

The 6gb logs were run with:

echo 6000000000 >> /sys/fs/cgroup/memory/julialimit/memory.limit_in_bytes
echo 6500000000 >> /sys/fs/cgroup/memory/julialimit/memory.memsw.limit_in_bytes

The 10gb logs with 10000000000 (10500000000) as limit.
I added that memsw limit to avoid too much swapping. More swap is probably the reason why the Github CI jobs don't die as fast but struggle along and are killed after a timeout of 2.5 hours. The usual runtime for the full testsuite on github runners is around 1-1.5 hours.

other stats output in the logs

For each file in the Oscar testsuite the output contains a short memory summary from before the include and a line similar to @time after the file was done.

retries for 1.9rc1 and 1.9rc2

For both 1.9 versions the tests oom-crashed during the first try with 6gb but succeeded on the second try.

failed runs on master

For master I tried 6gb and 10gb each 6 times and all of them failed, the 6gb ones quite early (1000 lines of log output, instead of about 3000 for a successful run).

master+revert

For the current tests I reverted only the Fix interval for many pointers commit as this sufficed to avoid the issues.
I could also repeat the tests with both commits from that PR reverted or only the other one (Add bit to the GC tag) reverted.

v1.8

I also tried running it in 1.8 but that didn't work with my cgroup setup as it seems to use the free physical memory instead of the constrained memory:

Free mem:  116958.266 MiB
Free pmem: 116958.266 MiB

Then the job is killed after only a few minutes.
Maybe I can get it to run if I manage to switch to a proper cgroupv2 configuration instead of the unified that my system seems to be using currently. I will try to add this in the next days.

github actions

I also ran the oscar testsuite with GC.enable_logging and the memory stats on github actions here: https://github.com/oscar-system/Oscar.jl/actions/runs/4591003902/jobs/8107508082
This failed not directly due to running out of memory but probably because it was swapping hard and then some timing-test failed (horribly) with 140 seconds instead of less than 10.
I have attached the file here to make sure the log doesn't disappear:
github-actions-oscar-julia-nightly-swapping.txt
This contains some very big pauses:

GC: pause 128272.24ms. collected 2852.374413MB. full 

Edit:

Until I can run the 1.8 tests locally, here is a run with GC logging on github actions for 1.8.5, 1.9-rc2, and nightly. (Nightly is still running with very long GC pauses...)

@gbaraldi
Copy link

It seems my heuristic might be a bit too greedy which can be adjusted, but I'm not sure why it's so much worse on master, I would look into why master is allocating so much more memory because it seems to be a separate regression.

@fingolfin
Copy link
Member Author

Well, we bisected it to that commit, but of course it is always possible that the issue we are experiencing is caused by multiple commits combined; then the commit we flagged is just the last puzzle piece, but maybe not the core cause?

@fingolfin
Copy link
Member Author

Of course there is also always the possibility that there were API/ABI changes in the Julia kernel and we need to recompile all our JLLs for latest Julia (and for that we need a new libjulia_jll).

However, the fact that Julia nightly on macOS is not regressed is peculiar and kinda contradicts that theory. Hmm

@gbaraldi
Copy link

gbaraldi commented Apr 21, 2023

The change did make the GC a bit more greedy because in a specific workload it was being too agressive, maybe here's the opposite and it's being a bit too lazy, I will take a further look.

GC: pause 128272.24ms. collected 2852.374413MB. full we probably should've run a GC earlier here.

@benlorenz
Copy link
Member

The latest changes to the GC on master seem to have fixed (or at least significantly improved) the issue, maybe JuliaLang/julia#49315 ?

From last night, this one took 1h 44min

Julia Version 1.10.0-DEV.1081
Commit 499647e95d7 (2023-04-20 15:53 UTC)

while a few commits later with

Julia Version 1.10.0-DEV.1084
Commit f84fb5b26f8 (2023-04-20 16:52 UTC)

the tests ran in 59 minutes. And all following runs were similar, no crashes anymore so far. That approximately one hour duration matches the other julia versions (only 1.6 is significantly slower since it includes the doctests).

I have run it in my memory-constrained cgroup again and it also did succeed 3 times in a row with the low limit of 6GB.

@benlorenz
Copy link
Member

The latest changes to the GC on master seem to have fixed (or at least significantly improved) the issue, maybe JuliaLang/julia#49315 ?

I don't understand the details how this works but this PR did fix the issue. I haven't seen any timeouts in the past week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testsuite Everything concerning testsuite
Projects
None yet
Development

No branches or pull requests

4 participants