-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lapack/netlib: Dgesvd fails sporadically on travis #58
Comments
I have been unable to repeat this locally, so I suspect it is an unfortunate interaction between travis' virtualisation and OpenBLAS that we are tickling, possibly the kernel is killing it or it is aborting itself. I will try running this on a VM when I get some time. @martin-frbg I know how you love Gonum/OpenBLAS interactor bugs (and this is worse since it probably is also a travis interactor), but have you seen anything like this or have any ideas? |
Need more information - which version of OpenBLAS, what CPU is provided by your CI ? |
They seem to have settled down, but I'll add some additional instrumentation to the OpenBLAS install script to give that information for when we see it next. We currently log the OpenBLAS version only (OpenMathLib/OpenBLAS@4fc17d0). |
Hmm. If you are actually tracking current |
I have just caught one in adding the instrumentation. The complete log is here (raw), but for the version and cpuinfo:
|
cpuid corresponds to HASWELL target (and I assume your OpenBLAS is either DYNAMIC_ARCH or built on/for this host). |
Yes, built on the host.
|
Any idea when you saw this first (i.e., is it a relatively recent regression) ? DGESVD must be one of the "worst" functions in terms of BLAS usage so this will probably take some time to track down. Is the "gonum implementation" you mentioned above completely self-contained or does it use stock netlib LAPACK and BLAS ? |
We haven't been touching this repo for a while, but we have been seeing a lot of it since @vladimir-ch started fixing up our sanity assertion for entry into the C code from Go. I can see errors like this going back about a month, but they may have existed for longer. The Go Dgesvd is a pure Go implementation ported from the Fortran. |
If it is sporadic and not reproducible outside of Travis, I am more inclined to think of it as a Travis bug (hardware or whatever) - unless the random component in your matrix generation (if I read your test code correctly) makes it happen only with very specific and relatively rare combinations of data. I assume dumping the original matrix to a file in the failure case to allow re-runs with the exact same data is not practicable ? |
I think it is very likely to be either hardware bugs or concurrency bugs somewhere (maybe interactions between Go and OpenBLAS threading). We use defined random seeds, so the tests are always with the same data, but we could dump the matrix since we now the test case that fails. This would allow a possibility of a pure C reproducer to be written. My suspicion about the cause of the failure is that there is some memory corruption going on and this leads to test failures and then crashing, this means that dumping the input on failure is not guaranteed to work, it also means the the strategy above may not work. Before I do try dumping out the failure locally, I'm going to try running on a VM here to see if that is the cause. |
memcheck on a simple testcase (stolen from an old ATLAS bug) is clean, but helgrind does show some possible races between the dgemm microkernel and the dgemm_otcopy routine (at least when built without OpenMP support). You can build OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 to work around this (at the cost of reduced perfomance). |
I have just tried running the blas and lapack test suite from this repo (and the pure Go implementations from the gonum/gonum repo) on a VirtualBox VM (KVM, VT-x/AMD-V and nested paging enabled) running ubuntu 16.04 hosted on my laptop. The pure Go implementations pass, but the Cgo implementations calling OpenBLAS fail (including the blas test suite for complex float64 - which I have not seen before). The lapack tests fail in new ways and don't complete before they are killed for timeout (I suspect they are trapped in a loop for some reason). It's looking like a there is at least some kind of interaction between virtualisation, OpenBLAS and the Go runtime. I'll try with the USE_SIMPLE_THREADED_LEVEL3=1 and see what that brings. |
With that build option, I still see travis failures. |
Strange. I do not remember seeing hangs or other misbehaviour with OpenBLAS in VirtualBox (or qemu either). There was one somewhat recent issue with LAPACK calls looping indefinitely while trying to scale NaN values but this was before 3.8.0 - might help to know where they got killed. The other explanation for a hang would obviously be a thread deadlock. |
Bisecting this will be a nightmare due to iteration delays. I have sent a ticket to travis to see if they can suggest something. |
I am a step closer to understanding this I think. There are two issues here, the logging output truncation and the deviation from expectations from That leaves just the deviation from expectation, I will attempt to write a C reproducer for that from the failing input (there is only one which fails, though it does so in three separate test cases). |
Now that I have a clearer view, it seems that the deviation is really just a matter of a relative error of 1e-13 which is where we set our tolerance. I suspect that maybe the sporadic nature of it comes from occasional changes of ordering of float operations in the concurrent code. Why this happens only on travis and not locally is still confusing, but less troubling. |
Thanks. I suspect the 1e-13 deviations could come from the (slowly) increasing use of FMA operations in the BLAS kernels. (OpenMathLib/OpenBLAS#1332 is an example where error propagation in a spurious feedback loop led to quite dramatic differences) |
What are you trying to do?
Run tests on travis
What did you do?
This is sporadic, so just run tests.
What did you expect to happen?
Tests pass.
What actually happened?
We never see this with the gonum implementation, also note that the output truncates (this happens elsewhen as well - without this marked failure - and it suggests to me that OpenBLAS is doing something very bad behind the scenes).
What version of Go, Gonum, Gonum/netlib and C implementation are you using?
This was in the go1.9.x tests for #56, but I have restarted it.
Does this issue reproduce with the current master?
Yes.
The text was updated successfully, but these errors were encountered: