Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterministic failures (Julia crashes) on Base Julia CI on tester_win64 #147

Closed
DilumAluthge opened this issue Oct 15, 2021 · 125 comments
Closed

Comments

@DilumAluthge
Copy link
Contributor

Example log: https://build.julialang.org/#/builders/65/builds/4081

@DilumAluthge DilumAluthge added the bug Something isn't working label Oct 15, 2021
@DilumAluthge DilumAluthge changed the title Nondeterministic failures on Base Julia CI Nondeterministic failures on Base Julia CI on tester_win64 Oct 15, 2021
@ViralBShah
Copy link
Member

Is it possible to find a few other builds where this might have happened? It's hard to tell which test it is happening in too.

@Gnimuc Might this be due to some of our recent ccall changes where we are generating clang wrappers?

@Gnimuc
Copy link
Member

Gnimuc commented Oct 15, 2021

I'm not sure. This commit might be related: JuliaSparse/SuiteSparse.jl@bb068bb. But it should be fixed.

Is SuiteSparse.jl the only user of detect_ambiguities? I guess no. To me, this looks like the detect_ambiguities itself is broken.

@DilumAluthge
Copy link
Contributor Author

It looks like the SuiteSparse.jl stdlib was last bumped on August 24: https://github.com/JuliaLang/julia/commits/master/stdlib/SuiteSparse.version

So why are we just starting to see these failures now?

@DilumAluthge
Copy link
Contributor Author

@Gnimuc Where are you seeing the issue with detect_ambiguities?

@Gnimuc
Copy link
Member

Gnimuc commented Oct 15, 2021

caused by: failed process: ...
ambiguous
compiler/inference compiler/validation compiler/ssair compiler/irpasses compiler/codegen compiler/inline compiler/contextual subarray strings/basic strings/search strings/util strings/io strings/types unicode/utf8 core worlds atomics
keywordargs numbers subtype char triplequote intrinsics dict hashing iobuffer staged offsetarray arrayops tuple reduce reducedim abstractarray intfuncs simdloop vecelement rational bitarray copy math fastmath functional iterators operators ordering
path ccall parse loading gmp sorting spawn backtrace exceptions file read version namedtuple mpfr broadcast complex floatapprox reflection regex float16 combinatorics sysinfo env rounding ranges mod2pi euler show client errorshow sets goto llvmcall
llvmcall2 ryu some meta stacktraces docs misc threads stress binaryplatforms atexit enums cmdlineargs int interpreter checked bitset floatfuncs precompile boundscheck error cartesian osutils channels iostream secretbuffer specificity reinterpretarray
syntax corelogging missing asyncmap smallarrayshrink opaque_closure filesystem download SparseArrays/higherorderfns SparseArrays/sparse SparseArrays/sparsevector LinearAlgebra/triangular LinearAlgebra/qr LinearAlgebra/dense LinearAlgebra/matmul LinearAlgebra/schur
LinearAlgebra/special LinearAlgebra/eigen LinearAlgebra/bunchkaufman LinearAlgebra/svd LinearAlgebra/lapack LinearAlgebra/tridiag LinearAlgebra/bidiag LinearAlgebra/diagonal LinearAlgebra/cholesky LinearAlgebra/lu LinearAlgebra/symmetric LinearAlgebra/generic
LinearAlgebra/uniformscaling LinearAlgebra/lq LinearAlgebra/hessenberg LinearAlgebra/blas LinearAlgebra/adjtrans LinearAlgebra/pinv LinearAlgebra/givens LinearAlgebra/structuredbroadcast LinearAlgebra/addmul LinearAlgebra/ldlt LinearAlgebra/factorization
LibGit2/libgit2 Dates/accessors Dates/adjusters Dates/query Dates/periods Dates/ranges Dates/rounding Dates/types Dates/io Dates/arithmetic Dates/conversions ArgTools Artifacts Base64 CRC32c CompilerSupportLibraries_jll DelimitedFiles Distributed Downloads
FileWatching Future GMP_jll InteractiveUtils LLVMLibUnwind_jll LazyArtifacts LibCURL LibCURL_jll LibGit2_jll LibSSH2_jll LibUV_jll LibUnwind_jll Libdl Logging MPFR_jll Markdown MbedTLS_jll Mmap MozillaCACerts_jll NetworkOptions OpenBLAS_jll OpenLibm_jll
PCRE2_jll Printf Profile REPL Random SHA Serialization SharedArrays Sockets Statistics SuiteSparse SuiteSparse_jll TOML Tar Test UUIDs Unicode Zlib_jll dSFMT_jll libLLVM_jll libblastrampoline_jll nghttp2_jll p7zip_jll LibGit2/online download

@Gnimuc
Copy link
Member

Gnimuc commented Oct 15, 2021

This is probably caused by some compiler internal changes related to inference.

@DilumAluthge
Copy link
Contributor Author

That line just lists all of the test sets that Buildbot ran. E.g. Buildbot ran the ambiguous test set, the compiler/inference test set, etc.

If you scroll up in the log, you can see which test sets passed and which test sets failed.

All of the test sets are passing except for SuiteSparse. For example, the ambiguous test set is passing, the compiler/inference test set is passing, etc.

@DilumAluthge
Copy link
Contributor Author

DilumAluthge commented Oct 15, 2021

The issue here is specifically that the Julia process is crashing sometime during the SuiteSparse test set.

Scroll up to e.g. 525 of https://build.julialang.org/#/builders/65/builds/4081/steps/5/logs/stdio, where it says "Worker 7 terminated". That's where the Julia process is crashing.

@DilumAluthge DilumAluthge changed the title Nondeterministic failures on Base Julia CI on tester_win64 Nondeterministic failures (Julia crashes) on Base Julia CI on tester_win64 Oct 15, 2021
@Gnimuc
Copy link
Member

Gnimuc commented Oct 15, 2021

Maybe these lines are no longer valid.

This can not be the reason. If the size is wrong, then the tests should fail every time.

@Gnimuc
Copy link
Member

Gnimuc commented Oct 15, 2021

How could I reproduce this locally?

@DilumAluthge
Copy link
Contributor Author

Do you have a Windows machine locally?

If so, you could maybe try running the SuiteSparse tests in a while loop and waiting for it to crash?

@Gnimuc
Copy link
Member

Gnimuc commented Oct 15, 2021

Yes, I have a Windows machine. Should I build Julia in a cygwin environment or just use the nightly?

@DilumAluthge
Copy link
Contributor Author

I would build from source. Then, repeatedly run Base.runtests(["SuiteSparse"]).

@Gnimuc
Copy link
Member

Gnimuc commented Oct 15, 2021

Should I add any other configuation? I'm running the testsuite with both julia -t 12 and julia -t 1 for about 20 mins, haven't got a crash yet.

@DilumAluthge
Copy link
Contributor Author

It seems relatively rare. You may need to run it for a long time.

@ViralBShah
Copy link
Member

Should we disable the detect_ambiguities test for now? @KristofferC Do you know about this function?

@DilumAluthge
Copy link
Contributor Author

DilumAluthge commented Oct 15, 2021

This doesn't have anything to do with detect_ambiguities.

@Gnimuc
Copy link
Member

Gnimuc commented Oct 16, 2021

sorry, I misread those logs.

@Gnimuc
Copy link
Member

Gnimuc commented Oct 16, 2021

It seems relatively rare. You may need to run it for a long time.

I'll try it again today.

@Gnimuc
Copy link
Member

Gnimuc commented Oct 16, 2021

OK, another question. If Base.runtests(["SuiteSparse"]) could trigger a crash, how could I take a snapshot of the current stack for debugging later. Does rr support Windows now?

@ViralBShah
Copy link
Member

No

@ViralBShah
Copy link
Member

The crash happens in the first suitsparse test which is detect_ambiguities.

@DilumAluthge
Copy link
Contributor Author

The crash happens in the first suitsparse test which is detect_ambiguities.

How do you know the crash is happening in the first SuiteSparse test?

@Gnimuc
Copy link
Member

Gnimuc commented Oct 16, 2021

Looks like the first faliure is triggered by JuliaLang/julia@a512f1a.

Could this OPENBLAS_MAIN_FREE env variable affect suitesparse or its testset?

@ViralBShah
Copy link
Member

It shows only 1 error and then crashes (in the tests table). So guessing it is the first or second test.

@DilumAluthge
Copy link
Contributor Author

Looks like the first faliure is triggered by JuliaLang/julia@a512f1a. Could this OPENBLAS_MAIN_FREE env variable affects suitesparse or its testset?

Just FYI, I'm not sure if that's the first occurrence. I stopped once I had a few examples.

@ViralBShah
Copy link
Member

ViralBShah commented Jul 13, 2022

@Wimmerer It is worth reviewing the codebase in light of this comment by @Gnimuc:

#181 (comment)

but $_free_numeric(Ptr{Ptr{Cvoid}}(lu.numeric)) causes segfault,

This is equivalent to the following C code:

void *x;
void **y;
x = lu.numeric;
y = x;
// then use y as if it's a pointer to a pointer
// which will trigger segment fault because y is not correctly initialized 

But what we should do is:

void *x;
void **y;
x = lu.numeric;
y = &x;

In Julia, as @Wimmerer mentioned above, we should use Ref(lu.numeric) to initialize the pointer.

@ViralBShah
Copy link
Member

UMFPACK win64 CI failure is back in https://build.julialang.org/#/builders/63/builds/6776/steps/5/logs/stdio

It would be great to address this asap. @Gnimuc Any help would be appreciated - you chimed in here earlier as well, but could you take another look?

@Gnimuc
Copy link
Member

Gnimuc commented Jul 13, 2022

  1. https://build.julialang.org/#/builders/63/builds/6789/steps/5/logs/stdio
      From worker 4:	umfpack_di_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 4:	umfpack_di_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\lib\x86_64-w64-mingw32.jl:1748
      From worker 4:	#umfpack_numeric!#12 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:490
      From worker 4:	umfpack_numeric! at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:483 [inlined]
      From worker 4:	#lu#1 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:280
      From worker 4:	lu at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:269
      From worker 4:	unknown function (ip: 0000000118b85c76)
  1. https://build.julialang.org/#/builders/63/builds/6319/steps/5/logs/stdio
      From worker 3:	umfpack_dl_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libumfpack.DLL (unknown line)
      From worker 3:	umfpack_dl_numeric at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\lib\x86_64-w64-mingw32.jl:1752
      From worker 3:	#umfpack_numeric!#14 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:490
      From worker 3:	umfpack_numeric! at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:483 [inlined]
      From worker 3:	#lu#1 at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:280
      From worker 3:	lu at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\umfpack.jl:269 [inlined]

I noticed some of these nondeterministic errors are triggered by umfpack_di_numeric(for Int32) or umfpack_dl_numeric(for Int64).

It feels like the code path is randomly chosen on the buildbot machine. When the right version of the C function is called, no error occurs. If not, we get these nondeterministic errors. However, I can't find anything wrong in the code that will cause this behavior, so this is just a guess.

  1. https://build.julialang.org/#/builders/63/builds/6775/steps/5/logs/stdio
      From worker 3:	Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
      From worker 3:	Exception: EXCEPTION_ACCESS_VIOLATION at 0xe1a03914 -- .text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	in expression starting at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\test\cholmod.jl:18
      From worker 3:	.text at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	read_triplet at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	cholmod_l_read_triplet at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	cholmod_l_read_sparse at C:\buildbot\worker-tabularasa\tester_win64\build\bin\libcholmod.DLL (unknown line)
      From worker 3:	cholmod_l_read_sparse at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\lib\x86_64-w64-mingw32.jl:932
      From worker 3:	unknown function (ip: 0000000067bf671c)
      From worker 3:	_jl_invoke at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2393 [inlined]
      From worker 3:	ijl_apply_generic at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2575
      From worker 3:	read_sparse at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\cholmod.jl:646
      From worker 3:	read_sparse at C:\buildbot\worker-tabularasa\tester_win64\build\share\julia\stdlib\v1.9\SparseArrays\src\solvers\cholmod.jl:651

New nondeterministic errors related to CHOLMOD seem to start happening recently. Not sure whether they are related or not.

@ViralBShah
Copy link
Member

ViralBShah commented Jul 13, 2022

What about this? Does it look good?

tmp = Ref{Ptr{Cvoid}}()

Clearly something is overwriting pointers somewhere. @SobhanMP Since you have been in the umfpack.jl codebase recently, perhaps you may have some ideas as well. I am unable to reproduce this on my win64 VM (just like everyone else here).

@Gnimuc
Copy link
Member

Gnimuc commented Jul 13, 2022

Better to use tmp = Ref{Ptr{Cvoid}}(C_NULL) to make sure the pointer is 0-initialized.

If SuiteSparse has some code blocks like:

if (tmp == NULL) {
   ...
}
else {
...
}

then things could go wrong.

@ViralBShah
Copy link
Member

ViralBShah commented Jul 13, 2022

@Gnimuc What about this comment: #147 (comment)? Is tmp = Vector{Ptr{Cvoid}}(undef, 1) somehow safer than tmp = Ref{Ptr{Cvoid}}()?

@SobhanMP
Copy link
Member

SobhanMP commented Jul 13, 2022

@Gnimuc afair the umfpack docs say that they don't care about the content of the pointers

@SobhanMP
Copy link
Member

@ViralBShah

**Numeric is the address of a (void *) pointer variable in the user’s
calling routine (see Syntax, above). On input, the contents of this
variable are not defined. On output, this variable holds a (void *)
pointer to the Numeric object (if successful), or (void *) NULL if
a failure occurred

@ViralBShah
Copy link
Member

ViralBShah commented Jul 13, 2022

Good to know. I think initializing them to NULL is probably a good idea anyways - but it probably won't solve the issue we are facing. I updated my comment #147 (comment) about the contents of this pointer.

How do we find out if the right int/long versions are being called - since we can't reproduce anywhere else. Perhaps throw in some printfs?

@DilumAluthge @staticfloat It would be great if there were a way for us to bisect on the CI infrastructure, and see if this can be reproduced there.

@SobhanMP
Copy link
Member

at least in the case of https://build.julialang.org/#/builders/63/builds/6319/steps/5/logs/stdio
the failing test is

 a = SparseMatrixCSC(2, 2, [1, 3, 5], [1, 2, 1, 2], [1.0, 0.0, 0.0, 1.0])
@test lu(a)\[2.0, 3.0]  [2.0, 3.0]   

it's calling the dl function which looks correct. maybe we have corruptions? one idea would be to call
umfpack_report_numeric/umfpack_report_symbolic after every test (with level set to zero for no output). this will ensure that the library is working. tho it will require merging my pull request.

@ViralBShah
Copy link
Member

Merging and then bumping SparseArrays in julia so that the CI picks it.

@rayegun
Copy link
Member

rayegun commented Jul 13, 2022

Do we have an RR trace of these crashes? I won't claim to be able to step through one even if we did, but it would give us a method to exactly replicate on our machines.

@ViralBShah
Copy link
Member

No RR on windows.

@rayegun
Copy link
Member

rayegun commented Jul 13, 2022

Okay let me try this on the actual buildbot machine.

@SobhanMP
Copy link
Member

so alloc_solve! in the test were still using the global controls/info. maybe this will solve part of the non deterministic bugs? also the config initialization was wrong (it used the default config for dl for all matrices) so maybe it's worth bumping the stdlib after #181

@rayegun
Copy link
Member

rayegun commented Jul 13, 2022

As soon as that passes again we'll merge. Hopefully the changes will fix. Although now that we're getting this in Cholesky too I'm worried.

@SobhanMP
Copy link
Member

SobhanMP commented Jul 13, 2022

but why only windows? what's special about it? do they have better memory checks than linux?

@ViralBShah
Copy link
Member

There's also the issue of Clonglong on Win64, where long is 32 bit and you need long long for 64 bit. Although I did carefully check the interfaces generated and the way we compile suitesparse on win64 - and I didn't find any issue in the current codebase related to that.

@ViralBShah
Copy link
Member

@Wimmerer Even if the bump fixes the issue temporarily, we still have the recent binaries that fail to use to bisect on the buildbot.

@ViralBShah
Copy link
Member

@SobhanMP Either in your existing PR or a separate one, can you also initialize all those ** pointers to C_NULL as @Gnimuc suggested so that we can get that into the bump as well?

@SobhanMP
Copy link
Member

will do another patch, this one is already too big

@rayegun
Copy link
Member

rayegun commented Jul 13, 2022

Manually running the UMFPACK tests on the Windows buildbot is going just fine for me right now. I'll keep trying some things.

@ViralBShah
Copy link
Member

Run the whole Julia testsuite?

@rayegun
Copy link
Member

rayegun commented Jul 13, 2022

That's what I'm doing now.

@ViralBShah
Copy link
Member

ViralBShah commented Jul 14, 2022

The latest bump PR is failing in cholmod, but the umfpack tests are all passing. Also, we should get all the Refs in there to initialize to 0 etc.

@ViralBShah
Copy link
Member

Here's another cholmod failure: https://build.julialang.org/#/builders/63/builds/6826/steps/5/logs/stdio

With a more detailed stacktrace than we usually get.

@ViralBShah
Copy link
Member

Note that this has not happened recently for a while - but good to keep this open in case we see it again.

@vtjnash vtjnash closed this as completed Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants