-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nondeterministic failures (Julia crashes) on Base Julia CI on tester_win64
#147
Comments
tester_win64
Is it possible to find a few other builds where this might have happened? It's hard to tell which test it is happening in too. @Gnimuc Might this be due to some of our recent |
I'm not sure. This commit might be related: JuliaSparse/SuiteSparse.jl@bb068bb. But it should be fixed.
|
It looks like the SuiteSparse.jl stdlib was last bumped on August 24: https://github.com/JuliaLang/julia/commits/master/stdlib/SuiteSparse.version So why are we just starting to see these failures now? |
@Gnimuc Where are you seeing the issue with |
|
|
That line just lists all of the test sets that Buildbot ran. E.g. Buildbot ran the If you scroll up in the log, you can see which test sets passed and which test sets failed. All of the test sets are passing except for SuiteSparse. For example, the |
The issue here is specifically that the Julia process is crashing sometime during the SuiteSparse test set. Scroll up to e.g. 525 of https://build.julialang.org/#/builders/65/builds/4081/steps/5/logs/stdio, where it says "Worker 7 terminated". That's where the Julia process is crashing. |
tester_win64
tester_win64
This can not be the reason. If the size is wrong, then the tests should fail every time. |
How could I reproduce this locally? |
Do you have a Windows machine locally? If so, you could maybe try running the SuiteSparse tests in a |
Yes, I have a Windows machine. Should I build Julia in a cygwin environment or just use the nightly? |
I would build from source. Then, repeatedly run |
Should I add any other configuation? I'm running the testsuite with both |
It seems relatively rare. You may need to run it for a long time. |
Should we disable the |
This doesn't have anything to do with |
sorry, I misread those logs. |
I'll try it again today. |
OK, another question. If |
No |
The crash happens in the first suitsparse test which is detect_ambiguities. |
How do you know the crash is happening in the first SuiteSparse test? |
Looks like Could this |
It shows only 1 error and then crashes (in the tests table). So guessing it is the first or second test. |
Just FYI, I'm not sure if that's the first occurrence. I stopped once I had a few examples. |
@Wimmerer It is worth reviewing the codebase in light of this comment by @Gnimuc:
This is equivalent to the following C code: void *x;
void **y;
x = lu.numeric;
y = x;
// then use y as if it's a pointer to a pointer
// which will trigger segment fault because y is not correctly initialized But what we should do is: void *x;
void **y;
x = lu.numeric;
y = &x; In Julia, as @Wimmerer mentioned above, we should use |
UMFPACK win64 CI failure is back in https://build.julialang.org/#/builders/63/builds/6776/steps/5/logs/stdio It would be great to address this asap. @Gnimuc Any help would be appreciated - you chimed in here earlier as well, but could you take another look? |
I noticed some of these nondeterministic errors are triggered by It feels like the code path is randomly chosen on the buildbot machine. When the right version of the C function is called, no error occurs. If not, we get these nondeterministic errors. However, I can't find anything wrong in the code that will cause this behavior, so this is just a guess.
New nondeterministic errors related to CHOLMOD seem to start happening recently. Not sure whether they are related or not. |
What about this? Does it look good? SparseArrays.jl/src/solvers/umfpack.jl Line 533 in 7974069
Clearly something is overwriting pointers somewhere. @SobhanMP Since you have been in the umfpack.jl codebase recently, perhaps you may have some ideas as well. I am unable to reproduce this on my win64 VM (just like everyone else here). |
Better to use If SuiteSparse has some code blocks like:
then things could go wrong. |
@Gnimuc What about this comment: #147 (comment)? Is |
@Gnimuc afair the umfpack docs say that they don't care about the content of the pointers |
|
Good to know. I think initializing them to NULL is probably a good idea anyways - but it probably won't solve the issue we are facing. I updated my comment #147 (comment) about the contents of this pointer. How do we find out if the right int/long versions are being called - since we can't reproduce anywhere else. Perhaps throw in some printfs? @DilumAluthge @staticfloat It would be great if there were a way for us to bisect on the CI infrastructure, and see if this can be reproduced there. |
at least in the case of https://build.julialang.org/#/builders/63/builds/6319/steps/5/logs/stdio a = SparseMatrixCSC(2, 2, [1, 3, 5], [1, 2, 1, 2], [1.0, 0.0, 0.0, 1.0])
@test lu(a)\[2.0, 3.0] ≈ [2.0, 3.0] it's calling the dl function which looks correct. maybe we have corruptions? one idea would be to call |
Merging and then bumping SparseArrays in julia so that the CI picks it. |
Do we have an RR trace of these crashes? I won't claim to be able to step through one even if we did, but it would give us a method to exactly replicate on our machines. |
No RR on windows. |
Okay let me try this on the actual buildbot machine. |
so alloc_solve! in the test were still using the global controls/info. maybe this will solve part of the non deterministic bugs? also the config initialization was wrong (it used the default config for dl for all matrices) so maybe it's worth bumping the stdlib after #181 |
As soon as that passes again we'll merge. Hopefully the changes will fix. Although now that we're getting this in Cholesky too I'm worried. |
but why only windows? what's special about it? do they have better memory checks than linux? |
There's also the issue of |
@Wimmerer Even if the bump fixes the issue temporarily, we still have the recent binaries that fail to use to bisect on the buildbot. |
will do another patch, this one is already too big |
Manually running the UMFPACK tests on the Windows buildbot is going just fine for me right now. I'll keep trying some things. |
Run the whole Julia testsuite? |
That's what I'm doing now. |
The latest bump PR is failing in cholmod, but the umfpack tests are all passing. Also, we should get all the |
Here's another cholmod failure: https://build.julialang.org/#/builders/63/builds/6826/steps/5/logs/stdio With a more detailed stacktrace than we usually get. |
Note that this has not happened recently for a while - but good to keep this open in case we see it again. |
Example log: https://build.julialang.org/#/builders/65/builds/4081
The text was updated successfully, but these errors were encountered: