Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing in parallel can make GAP crash #3089

Closed
aaruni96 opened this issue Dec 8, 2023 · 13 comments
Closed

Testing in parallel can make GAP crash #3089

aaruni96 opened this issue Dec 8, 2023 · 13 comments
Assignees
Labels
bug Something isn't working package: GAP

Comments

@aaruni96
Copy link
Member

aaruni96 commented Dec 8, 2023

When running testing in parallel, sometimes GAP will fail to load on one of the workers, which throws an error stopping all the workers. Issue appears to not manifest in regular testing.

Note that this does not always happen. But it still happens often enough to be a problem. Using more workers increases the likelihood of this appearing.

To Reproduce

using Pkg
Pkg.test("Oscar", test_args=["-j8"]) # might use a lot of memory, execute with care

Gets the error:

     Testing Running tests...
Adding worker processes
[ Info: /home/tester/.julia/dev/Oscar/test/runtests.jl -- SEED 3618563996
ERROR: LoadError: On worker 5:
InitError: Error thrown by GAP: Error, Error thrown by GAP: Error, dirname must be a directory at /home/tester/.julia/artifacts/b5c2f0f824457e5c391fb24916f94d5d91c62c4f/share/gap/lib/files.gi:374 called from
RemoveDirectoryRecursively( dir ); at /home/tester/.julia/scratchspaces/c863536a-3901-11e9-33e7-d5cd0df7b904/gap_11985453110266219935_1.9/pkg/packagemanager/gap/PackageManager.gi:1209 called from
Julia.Core._apply( julia_obj, args ) at /home/tester/.julia/packages/GAP/WWchg/pkg/JuliaInterface/gap/calls.gi:15 called from
PKGMAN_RemoveDir( dir ); at /home/tester/.julia/scratchspaces/c863536a-3901-11e9-33e7-d5cd0df7b904/gap_11985453110266219935_1.9/pkg/packagemanager/gap/PackageManager.gi:306 called from
InstallPackageFromArchive( url ) at /home/tester/.julia/scratchspaces/c863536a-3901-11e9-33e7-d5cd0df7b904/gap_11985453110266219935_1.9/pkg/packagemanager/gap/PackageManager.gi:236 called from
InstallPackageFromInfo( urls.(name), version ) at /home/tester/.julia/scratchspaces/c863536a-3901-11e9-33e7-d5cd0df7b904/gap_11985453110266219935_1.9/pkg/
not in any function at *defin*:0

Stacktrace:
  [1] error
    @ ./error.jl:35
  [2] ThrowObserver
    @ ~/.julia/packages/GAP/WWchg/src/GAP.jl:89
  [3] _call_gap_func
    @ ~/.julia/packages/GAP/WWchg/src/ccalls.jl:332
  [4] call_gap_func_nokw
    @ ~/.julia/packages/GAP/WWchg/src/ccalls.jl:284
  [5] #call_gap_func#9
    @ ~/.julia/packages/GAP/WWchg/src/ccalls.jl:273
  [6] #_#10
    @ ~/.julia/packages/GAP/WWchg/src/ccalls.jl:292
  [7] #install#8
    @ ~/.julia/packages/GAP/WWchg/src/packages.jl:236
  [8] install
    @ ~/.julia/packages/GAP/WWchg/src/packages.jl:223 [inlined]
  [9] #load#7
    @ ~/.julia/packages/GAP/WWchg/src/packages.jl:183
 [10] load (repeats 2 times)
    @ ~/.julia/packages/GAP/WWchg/src/packages.jl:93 [inlined]
 [11] #29
    @ ~/.julia/dev/Oscar/src/Oscar.jl:83 [inlined]
 [12] withenv
    @ ./env.jl:197
 [13] __init__
    @ ~/.julia/dev/Oscar/src/Oscar.jl:82
 [14] register_restored_modules
    @ ./loading.jl:1115
 [15] _include_from_serialized
    @ ./loading.jl:1061
 [16] _require_search_from_serialized
    @ ./loading.jl:1506
 [17] _require
    @ ./loading.jl:1783
 [18] _require_prelocked
    @ ./loading.jl:1660
 [19] macro expansion
    @ ./loading.jl:1648 [inlined]
 [20] macro expansion
    @ ./lock.jl:267 [inlined]
 [21] require
    @ ./loading.jl:1611
 [22] eval
    @ ./boot.jl:370
 [23] #invokelatest#2
    @ ./essentials.jl:819
 [24] invokelatest
    @ ./essentials.jl:816
 [25] #114
    @ ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:302
 [26] run_work_thunk
    @ ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
 [27] run_work_thunk
    @ ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:79
 [28] #100
    @ ./task.jl:514
during initialization of module Oscar
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:445
 [2] macro expansion
   @ ./task.jl:477 [inlined]
 [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
   @ Distributed ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Distributed/src/macros.jl:219
 [4] top-level scope
   @ ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Distributed/src/macros.jl:203
 [5] include(fname::String)
   @ Base.MainInclude ./client.jl:478
 [6] top-level scope
   @ none:6
in expression starting at /home/tester/.julia/dev/Oscar/test/runtests.jl:37
ERROR: Package Oscar errored during testing
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Pkg/src/Types.jl:69
 [2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
   @ Pkg.Operations ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Pkg/src/Operations.jl:2019
 [3] test
   @ ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Pkg/src/Operations.jl:1900 [inlined]
 [4] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Vector{String}, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool, kwargs::Base.Pairs{Symbol, Base.TTY, Tuple{Symbol}, NamedTuple{(:io,), Tuple{Base.TTY}}})
   @ Pkg.API ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Pkg/src/API.jl:441
 [5] test(pkgs::Vector{Pkg.Types.PackageSpec}; io::Base.TTY, kwargs::Base.Pairs{Symbol, Vector{String}, Tuple{Symbol}, NamedTuple{(:test_args,), Tuple{Vector{String}}}})
   @ Pkg.API ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Pkg/src/API.jl:156
 [6] #test#75
   @ ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Pkg/src/API.jl:144 [inlined]
 [7] test
   @ ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Pkg/src/API.jl:144 [inlined]
 [8] #test#74
   @ ~/runners/runner-1/_work/_tool/julia/1.9.4/x64/share/julia/stdlib/v1.9/Pkg/src/API.jl:143 [inlined]
 [9] top-level scope
   @ REPL[6]:1

Expected behavior
Oscar should be able to load on all workers, and testing should continue

System (please complete the following information):

julia> Oscar.versioninfo(full=true)
OSCAR version 0.14.0-DEV - #master, 32b7e89448 -- 2023-12-08 10:17:27 +0100
  combining:
    AbstractAlgebra.jl   v0.33.0
    GAP.jl               v0.10.0
    Hecke.jl             v0.22.9
    Nemo.jl              v0.37.5
    Polymake.jl          v0.11.9
    Singular.jl          v0.20.0
  building on:
    Antic_jll               v0.201.500+0
    Arb_jll                 v200.2300.0+0
    Calcium_jll             v0.401.100+0
    FLINT_jll               v200.900.7+0
    GAP_jll                 v400.1200.200+7
    Singular_jll            v403.210.1000+0
    libpolymake_julia_jll   v0.11.2+0
    libsingular_julia_jll   v0.40.7+0
    polymake_jll            v400.1100.1+0
See `]st -m` for a full list of dependencies.

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 50 × QEMU Virtual CPU version 2.5+
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, nocona)
  Threads: 1 on 50 virtual cores
Official https://julialang.org/ release
...
@aaruni96 aaruni96 added the bug Something isn't working label Dec 8, 2023
@benlorenz
Copy link
Member

Source line src/Oscar.jl:83 is:

      GAP.Packages.load("browse"; install=true) # needed for all_character_table_names doctest

I think what happens here is that the installation of browse fails (maybe due to issues with the system-ncurses installation (I have seen that happen as well). If this does not work on the main process then this usually silently soft-fails and Oscar continues loading.
But now all workers will also try to reinstall browse and probably operate on the same files at the same time.
Maybe that operation could get some kind of pid-locking?

@aaruni96
Copy link
Member Author

aaruni96 commented Dec 8, 2023

Alright, installing libncurses-dev seems to have fixed it for now. Maybe we should also complain whenever GAP.Packages.load() returns false, instead of failing silently ?

@fingolfin
Copy link
Member

So there are multiple issues here (all of which should be fixed)

  1. Oscar.jl: we don't check the return value of GAP.Packages.load but should
  2. GAP.jl should do some kind of PID locking when installing packages (OK, might not be a problem here right now, but it's still something we should do, and not hard)
  3. there is a surprising requirement on libncurses-dev ...

Points 1 and 2 are quick to do. Point 3: the first part (installing libncurses-dev) also is easy to do, but then we also should do some or all of the rest I described...

@ThomasBreuer
Copy link
Member

@fingolfin
Currently there are four places in Oscar where GAP.Packages.load is called:

  • Oscar's __init__ loads several GAP packages and signals an error if this fails. The install=true parameter is not used here because the packages in question do not need to be compiled. This is fine.
  • experimental/StandardFiniteFields/test/runtests.jl tries to load the standardff package, without the install=true parameter. Here I see no problem, except that we could check for the result and put this check into a @test, in order to get a better message about test failures.
  • Oscar's __init__ (for the browse package) as well the __init__ function of the GaloisGrp module (for the ferret package) call the function with the install=true parameter, and do not check whether the installation is successful. My interpretation is that Oscar is asked to make the packages in question available if possible, but that it is not a problem if the package cannot be loaded. Note that the installation of a package is impossible if one is offline, and it is fatal if one cannot start Oscar in such a situation, for example in a train.
    Thus is it really desirable to check for the result and signal an error in the case of failure?

Concerning the occasional errors, I think the point is not the failure of the installation but the fact that several instances try to install the package in the same place, at the same time.
If I understand the suggestion right then the first instance that wants to install a package should set some lock in order to tell other instances that they should not try to install this package; what should they do then: wait until the lock disappears, then check that the package is in fact installed, and then return true?
(Do we have already situations where such a strategy is used? What shall happen if a lock does not get removed due to a crash? And would the right place for such a mechanism be GAP.jl or rather GAP's PackageManager package?)

Concerning the dependency on libncurses-dev, this fact is stated informally in the PackageInfo.g file of the browse package, but currently we have no mechanism to formalize such external dependencies on the GAP side.

@lgoettgens
Copy link
Member

Oscar's init (for the browse package) as well the init function of the GaloisGrp module (for the ferret package) call the function with the install=true parameter, and do not check whether the installation is successful. My interpretation is that Oscar is asked to make the packages in question available if possible, but that it is not a problem if the package cannot be loaded. Note that the installation of a package is impossible if one is offline, and it is fatal if one cannot start Oscar in such a situation, for example in a train.
Thus is it really desirable to check for the result and signal an error in the case of failure?

I think that we shouldn't terminate the session with an error in these cases due to the reasons already stated by you. But we could signal the situation to the user with e.g. a @warn.

@aaruni96
Copy link
Member Author

Maybe we should decide that child processes must never try to install gap packages? (Either the main process already successfully installed it and we don't need to install anymore, or the main process has already failed to install these, and we know in advance that the child processes will also fail to install these).

@ThomasBreuer
Copy link
Member

Maybe we should decide that child processes must never try to install gap packages? (Either the main process already successfully installed it and we don't need to install anymore, or the main process has already failed to install these, and we know in advance that the child processes will also fail to install these).

This sounds like a good idea.
What is the recommended way to distinguish the main process from child processes?

@benlorenz
Copy link
Member

Making sure child processes don't install GAP packages seems reasonable but is not really a solution.
It would work for Distributed.jl but when running Oscar with something like GNU parallel there are no "child" processes and users could still run into the same problem.

I think the quickest solution would be to use Pidfile.jl to make sure only one process can try to install packages at the same time.

Regarding ncurses, we do have Ncurses_jll in our dependencies so we could avoid the system package if we could pass the correct build flags for the compilation.

PS: I noticed that there is already another ticket about the same issue: oscar-system/GAP.jl#561

@ThomasBreuer
Copy link
Member

O.k., I am going to change GAP.Packages.install such that Pidfile.mkpidlock is used.
(There is already an analogous situation in GAP.Setup.regenerate_gaproot.)

@thofma
Copy link
Collaborator

thofma commented Dec 30, 2023

I have recently seen a lot of the following, when running the Oscar tests in parallel:

ERROR: LoadError: On worker 7:
InitError: GAP variable _JULIAINTERFACE_ERROR_BUFFER not bound
Stacktrace:
 [1] error
   @ ./error.jl:35
 [2] getproperty
   @ ~/repositories/bla/packages/GAP/WWchg/src/globals.jl:42
 [3] error_handler
   @ ~/repositories/bla/packages/GAP/WWchg/src/GAP.jl:72
during initialization of module GAP

Is this error included in the list above?

@benlorenz
Copy link
Member

The original error should not appear anymore with GAP 0.10.1 (oscar-system/GAP.jl#956). So I think this can be closed.

I haven't seen that _JULIAINTERFACE_ERROR_BUFFER so far, can't really say if this is related. Do you still get that error with the new GAP version?

Regarding the GAP Browse / ncurses stuff, there is oscar-system/GAP.jl#614.

@fingolfin
Copy link
Member

OK closing this for now

@fingolfin
Copy link
Member

Multi threading in OSCAR unfortunately still is not really supported. That said, I still wouldn't expect it to fail like that. But here is the wrong place to discuss this: please open a new issue at the GAP.jl repository including precise steps for reproducing (as in: how did you start Julia, what did you enter, and on which OS / setup).

If you can repro it with just using GAP, that'd be best, but it's also OK with using Oscar

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working package: GAP
Projects
None yet
Development

No branches or pull requests

6 participants