Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterministic failures (Julia crashes) on Base Julia CI on tester_win64 #147

DilumAluthge opened this issue Oct 15, 2021 · 125 comments


Copy link

Example log:

@DilumAluthge DilumAluthge added the bug Something isn't working label Oct 15, 2021
@DilumAluthge DilumAluthge changed the title Nondeterministic failures on Base Julia CI Nondeterministic failures on Base Julia CI on tester_win64 Oct 15, 2021
Copy link

Is it possible to find a few other builds where this might have happened? It's hard to tell which test it is happening in too.

@Gnimuc Might this be due to some of our recent ccall changes where we are generating clang wrappers?

Copy link

Gnimuc commented Oct 15, 2021

I'm not sure. This commit might be related: JuliaSparse/SuiteSparse.jl@bb068bb. But it should be fixed.

Is SuiteSparse.jl the only user of detect_ambiguities? I guess no. To me, this looks like the detect_ambiguities itself is broken.

Copy link
Contributor Author

It looks like the SuiteSparse.jl stdlib was last bumped on August 24:

So why are we just starting to see these failures now?

Copy link
Contributor Author

@Gnimuc Where are you seeing the issue with detect_ambiguities?

Copy link

Gnimuc commented Oct 15, 2021

caused by: failed process: ...
compiler/inference compiler/validation compiler/ssair compiler/irpasses compiler/codegen compiler/inline compiler/contextual subarray strings/basic strings/search strings/util strings/io strings/types unicode/utf8 core worlds atomics
keywordargs numbers subtype char triplequote intrinsics dict hashing iobuffer staged offsetarray arrayops tuple reduce reducedim abstractarray intfuncs simdloop vecelement rational bitarray copy math fastmath functional iterators operators ordering
path ccall parse loading gmp sorting spawn backtrace exceptions file read version namedtuple mpfr broadcast complex floatapprox reflection regex float16 combinatorics sysinfo env rounding ranges mod2pi euler show client errorshow sets goto llvmcall
llvmcall2 ryu some meta stacktraces docs misc threads stress binaryplatforms atexit enums cmdlineargs int interpreter checked bitset floatfuncs precompile boundscheck error cartesian osutils channels iostream secretbuffer specificity reinterpretarray
syntax corelogging missing asyncmap smallarrayshrink opaque_closure filesystem download SparseArrays/higherorderfns SparseArrays/sparse SparseArrays/sparsevector LinearAlgebra/triangular LinearAlgebra/qr LinearAlgebra/dense LinearAlgebra/matmul LinearAlgebra/schur
LinearAlgebra/special LinearAlgebra/eigen LinearAlgebra/bunchkaufman LinearAlgebra/svd LinearAlgebra/lapack LinearAlgebra/tridiag LinearAlgebra/bidiag LinearAlgebra/diagonal LinearAlgebra/cholesky LinearAlgebra/lu LinearAlgebra/symmetric LinearAlgebra/generic
LinearAlgebra/uniformscaling LinearAlgebra/lq LinearAlgebra/hessenberg LinearAlgebra/blas LinearAlgebra/adjtrans LinearAlgebra/pinv LinearAlgebra/givens LinearAlgebra/structuredbroadcast LinearAlgebra/addmul LinearAlgebra/ldlt LinearAlgebra/factorization
LibGit2/libgit2 Dates/accessors Dates/adjusters Dates/query Dates/periods Dates/ranges Dates/rounding Dates/types Dates/io Dates/arithmetic Dates/conversions ArgTools Artifacts Base64 CRC32c CompilerSupportLibraries_jll DelimitedFiles Distributed Downloads
FileWatching Future GMP_jll InteractiveUtils LLVMLibUnwind_jll LazyArtifacts LibCURL LibCURL_jll LibGit2_jll LibSSH2_jll LibUV_jll LibUnwind_jll Libdl Logging MPFR_jll Markdown MbedTLS_jll Mmap MozillaCACerts_jll NetworkOptions OpenBLAS_jll OpenLibm_jll
PCRE2_jll Printf Profile REPL Random SHA Serialization SharedArrays Sockets Statistics SuiteSparse SuiteSparse_jll TOML Tar Test UUIDs Unicode Zlib_jll dSFMT_jll libLLVM_jll libblastrampoline_jll nghttp2_jll p7zip_jll LibGit2/online download

Copy link

Gnimuc commented Oct 15, 2021

This is probably caused by some compiler internal changes related to inference.

Copy link
Contributor Author

That line just lists all of the test sets that Buildbot ran. E.g. Buildbot ran the ambiguous test set, the compiler/inference test set, etc.

If you scroll up in the log, you can see which test sets passed and which test sets failed.

All of the test sets are passing except for SuiteSparse. For example, the ambiguous test set is passing, the compiler/inference test set is passing, etc.

Copy link
Contributor Author

DilumAluthge commented Oct 15, 2021

The issue here is specifically that the Julia process is crashing sometime during the SuiteSparse test set.

Scroll up to e.g. 525 of, where it says "Worker 7 terminated". That's where the Julia process is crashing.

@DilumAluthge DilumAluthge changed the title Nondeterministic failures on Base Julia CI on tester_win64 Nondeterministic failures (Julia crashes) on Base Julia CI on tester_win64 Oct 15, 2021
Copy link

Gnimuc commented Oct 15, 2021

Maybe these lines are no longer valid.

This can not be the reason. If the size is wrong, then the tests should fail every time.

Copy link

Gnimuc commented Oct 15, 2021

How could I reproduce this locally?

Copy link
Contributor Author

Do you have a Windows machine locally?

If so, you could maybe try running the SuiteSparse tests in a while loop and waiting for it to crash?

Copy link

Gnimuc commented Oct 15, 2021

Yes, I have a Windows machine. Should I build Julia in a cygwin environment or just use the nightly?

Copy link
Contributor Author

I would build from source. Then, repeatedly run Base.runtests(["SuiteSparse"]).

Copy link

Gnimuc commented Oct 15, 2021

Should I add any other configuation? I'm running the testsuite with both julia -t 12 and julia -t 1 for about 20 mins, haven't got a crash yet.

Copy link
Contributor Author

It seems relatively rare. You may need to run it for a long time.

Copy link

Should we disable the detect_ambiguities test for now? @KristofferC Do you know about this function?

Copy link
Contributor Author

DilumAluthge commented Oct 15, 2021

This doesn't have anything to do with detect_ambiguities.

Copy link

Gnimuc commented Oct 16, 2021

sorry, I misread those logs.

Copy link

Gnimuc commented Oct 16, 2021

It seems relatively rare. You may need to run it for a long time.

I'll try it again today.

Copy link

Gnimuc commented Oct 16, 2021

OK, another question. If Base.runtests(["SuiteSparse"]) could trigger a crash, how could I take a snapshot of the current stack for debugging later. Does rr support Windows now?

Copy link


Copy link

The crash happens in the first suitsparse test which is detect_ambiguities.

Copy link
Contributor Author

The crash happens in the first suitsparse test which is detect_ambiguities.

How do you know the crash is happening in the first SuiteSparse test?

Copy link

Gnimuc commented Oct 16, 2021

Looks like the first faliure is triggered by JuliaLang/julia@a512f1a.

Could this OPENBLAS_MAIN_FREE env variable affect suitesparse or its testset?

Copy link

It shows only 1 error and then crashes (in the tests table). So guessing it is the first or second test.

Copy link
Contributor Author

Looks like the first faliure is triggered by JuliaLang/julia@a512f1a. Could this OPENBLAS_MAIN_FREE env variable affects suitesparse or its testset?

Just FYI, I'm not sure if that's the first occurrence. I stopped once I had a few examples.

Copy link

ViralBShah commented Jul 13, 2022

@Wimmerer It is worth reviewing the codebase in light of this comment by @Gnimuc:

#181 (comment)

but $_free_numeric(Ptr{Ptr{Cvoid}}(lu.numeric)) causes segfault,

This is equivalent to the following C code:

void *x;
void **y;
x = lu.numeric;
y = x;
// then use y as if it's a pointer to a pointer
// which will trigger segment fault because y is not correctly initialized 

But what we should do is:

void *x;
void **y;
x = lu.numeric;
y = &x;

In Julia, as @Wimmerer mentioned above, we should use Ref(lu.numeric) to initialize the pointer.

Copy link

UMFPACK win64 CI failure is back in

It would be great to address this asap. @Gnimuc Any help would be appreciated - you chimed in here earlier as well, but could you take another look?

Copy link

Gnimuc commented Jul 13, 2022

      From worker 4:	umfpack_di_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 4:	umfpack_di_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\lib\x86_64-w64-mingw32.jl:1748
      From worker 4:	#umfpack_numeric!#12 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:490
      From worker 4:	umfpack_numeric! at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:483 [inlined]
      From worker 4:	#lu#1 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:280
      From worker 4:	lu at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:269
      From worker 4:	unknown function (ip: 0000000118b85c76)
      From worker 3:	umfpack_dl_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 3:	umfpack_dl_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\lib\x86_64-w64-mingw32.jl:1752
      From worker 3:	#umfpack_numeric!#14 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:490
      From worker 3:	umfpack_numeric! at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:483 [inlined]
      From worker 3:	#lu#1 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:280
      From worker 3:	lu at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:269 [inlined]

I noticed some of these nondeterministic errors are triggered by umfpack_di_numeric(for Int32) or umfpack_dl_numeric(for Int64).

It feels like the code path is randomly chosen on the buildbot machine. When the right version of the C function is called, no error occurs. If not, we get these nondeterministic errors. However, I can't find anything wrong in the code that will cause this behavior, so this is just a guess.

      From worker 3:	Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 3:	Exception: EXCEPTION_ACCESS_VIOLATION at 0xe1a03914 -- .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	in expression starting at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\test\cholmod.jl:18
      From worker 3:	.text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	read_triplet at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	cholmod_l_read_triplet at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	cholmod_l_read_sparse at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	cholmod_l_read_sparse at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\lib\x86_64-w64-mingw32.jl:932
      From worker 3:	unknown function (ip: 0000000067bf671c)
      From worker 3:	_jl_invoke at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2393 [inlined]
      From worker 3:	ijl_apply_generic at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2575
      From worker 3:	read_sparse at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\cholmod.jl:646
      From worker 3:	read_sparse at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\cholmod.jl:651

New nondeterministic errors related to CHOLMOD seem to start happening recently. Not sure whether they are related or not.

Copy link

ViralBShah commented Jul 13, 2022

What about this? Does it look good?

tmp = Ref{Ptr{Cvoid}}()

Clearly something is overwriting pointers somewhere. @SobhanMP Since you have been in the umfpack.jl codebase recently, perhaps you may have some ideas as well. I am unable to reproduce this on my win64 VM (just like everyone else here).

Copy link

Gnimuc commented Jul 13, 2022

Better to use tmp = Ref{Ptr{Cvoid}}(C_NULL) to make sure the pointer is 0-initialized.

If SuiteSparse has some code blocks like:

if (tmp == NULL) {
else {

then things could go wrong.

Copy link

ViralBShah commented Jul 13, 2022

@Gnimuc What about this comment: #147 (comment)? Is tmp = Vector{Ptr{Cvoid}}(undef, 1) somehow safer than tmp = Ref{Ptr{Cvoid}}()?

Copy link

SobhanMP commented Jul 13, 2022

@Gnimuc afair the umfpack docs say that they don't care about the content of the pointers

Copy link


**Numeric is the address of a (void *) pointer variable in the user’s
calling routine (see Syntax, above). On input, the contents of this
variable are not defined. On output, this variable holds a (void *)
pointer to the Numeric object (if successful), or (void *) NULL if
a failure occurred

Copy link

ViralBShah commented Jul 13, 2022

Good to know. I think initializing them to NULL is probably a good idea anyways - but it probably won't solve the issue we are facing. I updated my comment #147 (comment) about the contents of this pointer.

How do we find out if the right int/long versions are being called - since we can't reproduce anywhere else. Perhaps throw in some printfs?

@DilumAluthge @staticfloat It would be great if there were a way for us to bisect on the CI infrastructure, and see if this can be reproduced there.

Copy link

at least in the case of
the failing test is

 a = SparseMatrixCSC(2, 2, [1, 3, 5], [1, 2, 1, 2], [1.0, 0.0, 0.0, 1.0])
@test lu(a)\[2.0, 3.0]  [2.0, 3.0]   

it's calling the dl function which looks correct. maybe we have corruptions? one idea would be to call
umfpack_report_numeric/umfpack_report_symbolic after every test (with level set to zero for no output). this will ensure that the library is working. tho it will require merging my pull request.

Copy link

Merging and then bumping SparseArrays in julia so that the CI picks it.

Copy link

rayegun commented Jul 13, 2022

Do we have an RR trace of these crashes? I won't claim to be able to step through one even if we did, but it would give us a method to exactly replicate on our machines.

Copy link

No RR on windows.

Copy link

rayegun commented Jul 13, 2022

Okay let me try this on the actual buildbot machine.

Copy link

so alloc_solve! in the test were still using the global controls/info. maybe this will solve part of the non deterministic bugs? also the config initialization was wrong (it used the default config for dl for all matrices) so maybe it's worth bumping the stdlib after #181

Copy link

rayegun commented Jul 13, 2022

As soon as that passes again we'll merge. Hopefully the changes will fix. Although now that we're getting this in Cholesky too I'm worried.

Copy link

SobhanMP commented Jul 13, 2022

but why only windows? what's special about it? do they have better memory checks than linux?

Copy link

There's also the issue of Clonglong on Win64, where long is 32 bit and you need long long for 64 bit. Although I did carefully check the interfaces generated and the way we compile suitesparse on win64 - and I didn't find any issue in the current codebase related to that.

Copy link

@Wimmerer Even if the bump fixes the issue temporarily, we still have the recent binaries that fail to use to bisect on the buildbot.

Copy link

@SobhanMP Either in your existing PR or a separate one, can you also initialize all those ** pointers to C_NULL as @Gnimuc suggested so that we can get that into the bump as well?

Copy link

will do another patch, this one is already too big

Copy link

rayegun commented Jul 13, 2022

Manually running the UMFPACK tests on the Windows buildbot is going just fine for me right now. I'll keep trying some things.

Copy link

Run the whole Julia testsuite?

Copy link

rayegun commented Jul 13, 2022

That's what I'm doing now.

Copy link

ViralBShah commented Jul 14, 2022

The latest bump PR is failing in cholmod, but the umfpack tests are all passing. Also, we should get all the Refs in there to initialize to 0 etc.

Copy link

Here's another cholmod failure:

With a more detailed stacktrace than we usually get.

Copy link

Note that this has not happened recently for a while - but good to keep this open in case we see it again.

@vtjnash vtjnash closed this as completed Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

9 participants