Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lapack/netlib: Dgesvd fails sporadically on travis #58

Closed
kortschak opened this issue Mar 13, 2019 · 19 comments · Fixed by #62
Closed

lapack/netlib: Dgesvd fails sporadically on travis #58

kortschak opened this issue Mar 13, 2019 · 19 comments · Fixed by #62

Comments

@kortschak
Copy link
Member

kortschak commented Mar 13, 2019

What are you trying to do?

Run tests on travis

What did you do?

This is sporadic, so just run tests.

What did you expect to happen?

Tests pass.

What actually happened?

=== RUN   TestDgesvd
--- FAIL: TestDgesvd (22.21s)
	dgesvd.go:280: Case m=300,n=150,work=minimum,mtype=5,job=NoneU-NoneVT: singular values differ between full and partial SVD
		[9.979201547673596e+291 9.912673537355772e+291 9.846145527037948e+291 9.779617516720131e+291 9.713089506402294e+291 9.646561496084475e+291 9.58003348576665e+291 9.513505475448824e+291 9.446977465131015e+291 9.380449454813184e+291 9.313921444495359e+291 9.247393434177529e+291 9.180865423859704e+291 9.114337413541888e+291 9.047809403224055e+291 8.981281392906233e+291 8.914753382588414e+291 8.848225372270582e+291 8.78169736195276e+291 8.715169351634937e+291 8.648641341317112e+291 8.582113330999292e+291 8.515585320681462e+291 8.449057310363651e+291 8.38252930004582e+291 8.316001289727993e+291 8.249473279410173e+291 8.182945269092347e+291 8.116417258774523e+291 8.049889248456703e+291 7.983361238138882e+291 7.916833227821053e+291 7.850305217503229e+291 7.783777207185413e+291 7.717249196867579e+291 7.650721186549755e+291 7.584193176231936e+291 7.517665165914115e+291 7.451137155596282e+291 7.384609145278459e+291 7.31808113496064e+291 7.251553124642813e+291 7.18502511432499e+291 7.118497104007166e+291 7.051969093689343e+291 6.985441083371521e+291 6.918913073053692e+291 6.852385062735869e+291 6.785857052418048e+291 6.719329042100225e+291 6.652801031782399e+291 6.586273021464578e+291 6.51974501114675e+291 6.45321700082893e+291 6.3866889905111e+291 6.320160980193282e+291 6.25363296987546e+291 6.187104959557632e+291 6.120576949239806e+291 6.054048938921988e+291 5.987520928604158e+291 5.920992918286338e+291 5.854464907968514e+291 5.787936897650688e+291 5.721408887332862e+291 5.654880877015039e+291 5.58835286669722e+291 5.521824856379398e+291 5.455296846061568e+291 5.388768835743747e+291 5.322240825425919e+291 5.255712815108098e+291 5.189184804790275e+291 5.122656794472448e+291 5.056128784154621e+291 4.9896007738368e+291 4.923072763518977e+291 4.8565447532011496e+291 4.7900167428833285e+291 4.7234887325655013e+291 4.6569607222476785e+291 4.5904327119298547e+291 4.5239047016120347e+291 4.457376691294204e+291 4.3908486809763825e+291 4.3243206706585586e+291 4.257792660340736e+291 4.19126465002291e+291 4.1247366397050876e+291 4.058208629387263e+291 3.9916806190694426e+291 3.925152608751615e+291 3.85862459843379e+291 3.792096588115969e+291 3.7255685777981444e+291 3.659040567480318e+291 3.592512557162495e+291 3.525984546844669e+291 3.4594565365268483e+291 3.392928526209022e+291 3.326400515891201e+291 3.259872505573379e+291 3.193344495255551e+291 3.126816484937728e+291 3.0602884746199035e+291 2.99376046430208e+291 2.9272324539842563e+291 2.8607044436664324e+291 2.794176433348609e+291 2.727648423030785e+291 2.6611204127129597e+291 2.5945924023951364e+291 2.528064392077313e+291 2.4615363817594898e+291 2.395008371441664e+291 2.3284803611238404e+291 2.261952350806016e+291 2.195424340488192e+291 2.1288963301703688e+291 2.0623683198525446e+291 1.9958403095347208e+291 1.9293122992168955e+291 1.8627842888990716e+291 1.7962562785812486e+291 1.7297282682634242e+291 1.6632002579456e+291 1.5966722476277756e+291 1.5301442373099517e+291 1.4636162269921276e+291 1.397088216674304e+291 1.330560206356479e+291 1.2640321960386543e+291 1.1975041857208318e+291 1.130976175403008e+291 1.064448165085184e+291 9.979201547673598e+290 9.313921444495358e+290 8.648641341317117e+290 7.983361238138882e+290 7.318081134960634e+290 6.652801031782398e+290 5.98752092860416e+290 5.322240825425917e+290 4.656960722247682e+290 3.991680619069446e+290 3.326400515891196e+290 2.6611204127129575e+290 1.9958403095347247e+290 1.330560206356472e+290 6.652801031782416e+289]
		[9.979201547673592e+291 9.912673537355789e+291 9.846145527037951e+291 9.779617516720115e+291 9.713089506402288e+291 9.646561496084495e+291 9.580033485766665e+291 9.513505475448798e+291 9.446977465130997e+291 9.380449454813188e+291 9.313921444495348e+291 9.247393434177519e+291 9.180865423859719e+291 9.114337413541878e+291 9.047809403224083e+291 8.981281392906244e+291 8.914753382588416e+291 8.8482253722706e+291 8.781697361952769e+291 8.715169351634926e+291 8.648641341317105e+291 8.582113330999301e+291 8.515585320681462e+291 8.449057310363647e+291 8.382529300045821e+291 8.316001289728003e+291 8.249473279410178e+291 8.182945269092372e+291 8.116417258774521e+291 8.049889248456695e+291 7.983361238138865e+291 7.916833227821076e+291 7.850305217503233e+291 7.783777207185418e+291 7.717249196867592e+291 7.650721186549764e+291 7.584193176231934e+291 7.51766516591411e+291 7.451137155596281e+291 7.384609145278462e+291 7.318081134960634e+291 7.25155312464282e+291 7.185025114324988e+291 7.118497104007171e+291 7.051969093689353e+291 6.985441083371527e+291 6.918913073053709e+291 6.85238506273588e+291 6.785857052418045e+291 6.719329042100219e+291 6.6528010317824e+291 6.586273021464575e+291 6.519745011146751e+291 6.453217000828928e+291 6.386688990511104e+291 6.3201609801933e+291 6.25363296987546e+291 6.187104959557637e+291 6.1205769492398e+291 6.054048938921985e+291 5.987520928604175e+291 5.920992918286336e+291 5.854464907968505e+291 5.787936897650686e+291 5.721408887332857e+291 5.654880877015043e+291 5.588352866697216e+291 5.521824856379395e+291 5.455296846061567e+291 5.388768835743743e+291 5.322240825425919e+291 5.255712815108099e+291 5.189184804790271e+291 5.122656794472444e+291 5.05612878415462e+291 4.989600773836802e+291 4.9230727635189707e+291 4.8565447532011496e+291 4.790016742883332e+291 4.7234887325655e+291 4.656960722247673e+291 4.590432711929848e+291 4.5239047016120347e+291 4.457376691294208e+291 4.3908486809763764e+291 4.32432067065856e+291 4.257792660340731e+291 4.191264650022911e+291 4.1247366397050876e+291 4.058208629387262e+291 3.9916806190694393e+291 3.9251526087516115e+291 3.858624598433784e+291 3.792096588115969e+291 3.7255685777981455e+291 3.6590405674803183e+291 3.5925125571624966e+291 3.525984546844667e+291 3.459456536526841e+291 3.3929285262090206e+291 3.3264005158911973e+291 3.2598725055733734e+291 3.1933444952555457e+291 3.12681648493773e+291 3.0602884746199007e+291 2.99376046430208e+291 2.927232453984253e+291 2.8607044436664335e+291 2.7941764333486097e+291 2.727648423030785e+291 2.6611204127129614e+291 2.594592402395137e+291 2.5280643920773136e+291 2.461536381759483e+291 2.3950083714416623e+291 2.3284803611238384e+291 2.2619523508060112e+291 2.1954243404881896e+291 2.128896330170368e+291 2.062368319852541e+291 1.9958403095347208e+291 1.9293122992168974e+291 1.8627842888990725e+291 1.7962562785812453e+291 1.7297282682634225e+291 1.6632002579455984e+291 1.596672247627775e+291 1.5301442373099515e+291 1.463616226992128e+291 1.3970882166743046e+291 1.330560206356478e+291 1.2640321960386552e+291 1.1975041857208321e+291 1.1309761754030083e+291 1.0644481650851831e+291 9.979201547673584e+290 9.313921444495355e+290 8.648641341317111e+290 7.983361238138878e+290 7.318081134960634e+290 6.652801031782395e+290 5.9875209286041606e+290 5.322240825425924e+290 4.6569607222476846e+290 3.9916806190694446e+290 3.3264005158911934e+290 2.6611204127129568e+290 1.995840309534721e+290 1.3305602063564732e+290 6.65280103178235e+289]
	dgesvd.go:280: Case m=300,n=150,work=medium,mtype=5,job=NoneU-NoneVT: singular values differ between full and partial SVD
		[9.9792015476736e+291 9.912673537355772e+291 9.84614552703796e+291 9.779617516720141e+291 9.713089506402309e+291 9.646561496084488e+291 9.580033485766658e+291 9.513505475448833e+291 9.446977465131016e+291 9.380449454813189e+291 9.313921444495371e+291 9.24739343417753e+291 9.18086542385971e+291 9.114337413541885e+291 9.047809403224058e+291 8.981281392906233e+291 8.914753382588416e+291 8.848225372270585e+291 8.781697361952766e+291 8.715169351634938e+291 8.648641341317114e+291 8.582113330999295e+291 8.515585320681465e+291 8.449057310363648e+291 8.382529300045816e+291 8.316001289727993e+291 8.249473279410171e+291 8.182945269092349e+291 8.116417258774523e+291 8.0498892484567e+291 7.983361238138877e+291 7.916833227821048e+291 7.850305217503224e+291 7.78377720718541e+291 7.717249196867574e+291 7.650721186549754e+291 7.584193176231931e+291 7.517665165914112e+291 7.451137155596279e+291 7.384609145278455e+291 7.318081134960635e+291 7.251553124642812e+291 7.185025114324987e+291 7.118497104007164e+291 7.051969093689344e+291 6.985441083371521e+291 6.918913073053692e+291 6.852385062735868e+291 6.785857052418047e+291 6.719329042100223e+291 6.652801031782397e+291 6.586273021464578e+291 6.519745011146751e+291 6.453217000828927e+291 6.386688990511099e+291 6.320160980193281e+291 6.253632969875458e+291 6.187104959557632e+291 6.120576949239806e+291 6.054048938921988e+291 5.987520928604159e+291 5.920992918286338e+291 5.854464907968513e+291 5.787936897650688e+291 5.721408887332865e+291 5.65488087701504e+291 5.58835286669722e+291 5.521824856379395e+291 5.45529684606157e+291 5.388768835743747e+291 5.32224082542592e+291 5.255712815108099e+291 5.189184804790275e+291 5.122656794472448e+291 5.056128784154621e+291 4.9896007738368e+291 4.923072763518977e+291 4.8565447532011496e+291 4.790016742883328e+291 4.723488732565503e+291 4.6569607222476785e+291 4.5904327119298547e+291 4.5239047016120313e+291 4.457376691294207e+291 4.390848680976381e+291 4.324320670658558e+291 4.2577926603407353e+291 4.19126465002291e+291 4.1247366397050865e+291 4.058208629387266e+291 3.991680619069442e+291 3.9251526087516143e+291 3.8586245984337904e+291 3.792096588115969e+291 3.725568577798143e+291 3.659040567480318e+291 3.592512557162494e+291 3.525984546844669e+291 3.4594565365268483e+291 3.3929285262090234e+291 3.326400515891201e+291 3.2598725055733773e+291 3.1933444952555495e+291 3.126816484937728e+291 3.060288474619903e+291 2.99376046430208e+291 2.927232453984257e+291 2.8607044436664324e+291 2.794176433348608e+291 2.727648423030785e+291 2.66112041271296e+291 2.5945924023951375e+291 2.5280643920773136e+291 2.4615363817594906e+291 2.3950083714416637e+291 2.3284803611238406e+291 2.2619523508060154e+291 2.1954243404881912e+291 2.128896330170368e+291 2.062368319852544e+291 1.9958403095347194e+291 1.9293122992168952e+291 1.862784288899071e+291 1.7962562785812475e+291 1.729728268263423e+291 1.6632002579455992e+291 1.5966722476277748e+291 1.530144237309951e+291 1.4636162269921268e+291 1.3970882166743023e+291 1.3305602063564785e+291 1.2640321960386543e+291 1.1975041857208314e+291 1.1309761754030078e+291 1.0644481650851829e+291 9.979201547673597e+290 9.313921444495354e+290 8.648641341317113e+290 7.98336123813888e+290 7.318081134960632e+290 6.652801031782392e+290 5.987520928604153e+290 5.3222408254259115e+290 4.656960722247676e+290 3.991680619069444e+290 3.326400515891196e+290 2.661120412712955e+290 1.9958403095347188e+290 1.3305602063564744e+290 6.652801031782429e+289]
		[9.979201547673623e+291 9.912673537355772e+291 9.846145527037935e+291 9.77961751672012e+291 9.713089506402302e+291 9.646561496084487e+291 9.580033485766671e+291 9.513505475448816e+291 9.44697746513101e+291 9.380449454813158e+291 9.313921444495355e+291 9.247393434177519e+291 9.1808654238597e+291 9.114337413541882e+291 9.047809403224076e+291 8.981281392906227e+291 8.914753382588417e+291 8.848225372270592e+291 8.781697361952758e+291 8.715169351634923e+291 8.648641341317102e+291 8.582113330999294e+291 8.51558532068145e+291 8.449057310363632e+291 8.382529300045824e+291 8.316001289727993e+291 8.249473279410174e+291 8.182945269092349e+291 8.116417258774524e+291 8.049889248456707e+291 7.983361238138884e+291 7.916833227821051e+291 7.850305217503221e+291 7.783777207185395e+291 7.717249196867579e+291 7.65072118654975e+291 7.584193176231924e+291 7.517665165914109e+291 7.451137155596269e+291 7.384609145278451e+291 7.318081134960647e+291 7.25155312464281e+291 7.185025114324981e+291 7.118497104007172e+291 7.051969093689343e+291 6.985441083371524e+291 6.918913073053686e+291 6.852385062735867e+291 6.785857052418038e+291 6.71932904210022e+291 6.652801031782395e+291 6.586273021464579e+291 6.519745011146748e+291 6.453217000828933e+291 6.386688990511086e+291 6.320160980193271e+291 6.253632969875448e+291 6.187104959557635e+291 6.120576949239814e+291 6.054048938921989e+291 5.987520928604148e+291 5.920992918286341e+291 5.854464907968516e+291 5.787936897650685e+291 5.721408887332871e+291 5.654880877015036e+291 5.588352866697227e+291 5.521824856379403e+291 5.45529684606157e+291 5.388768835743743e+291 5.322240825425913e+291 5.255712815108099e+291 5.189184804790274e+291 5.122656794472447e+291 5.056128784154617e+291 4.989600773836794e+291 4.923072763518973e+291 4.856544753201146e+291 4.790016742883332e+291 4.7234887325655e+291 4.6569607222476796e+291 4.590432711929865e+291 4.5239047016120286e+291 4.4573766912942047e+291 4.3908486809763803e+291 4.324320670658553e+291 4.257792660340729e+291 4.1912646500229087e+291 4.1247366397050826e+291 4.058208629387271e+291 3.991680619069445e+291 3.92515260875162e+291 3.8586245984337916e+291 3.7920965881159705e+291 3.7255685777981405e+291 3.6590405674803166e+291 3.592512557162496e+291 3.5259845468446694e+291 3.4594565365268483e+291 3.3929285262090234e+291 3.3264005158912045e+291 3.25987250557338e+291 3.193344495255549e+291 3.126816484937728e+291 3.060288474619904e+291 2.9937604643020835e+291 2.927232453984253e+291 2.860704443666435e+291 2.7941764333486074e+291 2.727648423030784e+291 2.6611204127129586e+291 2.594592402395135e+291 2.5280643920773114e+291 2.4615363817594867e+291 2.3950083714416629e+291 2.3284803611238384e+291 2.2619523508060135e+291 2.1954243404881901e+291 2.1288963301703657e+291 2.0623683198525466e+291 1.9958403095347208e+291 1.929312299216895e+291 1.8627842888990697e+291 1.7962562785812467e+291 1.7297282682634228e+291 1.6632002579455962e+291 1.5966722476277762e+291 1.530144237309948e+291 1.463616226992124e+291 1.3970882166743012e+291 1.330560206356479e+291 1.264032196038656e+291 1.1975041857208336e+291 1.1309761754030081e+291 1.0644481650851837e+291 9.979201547673612e+290 9.313921444495354e+290 8.648641341317124e+290 7.983361238138884e+290 7.318081134960634e+290 6.6528010317824e+290 5.987520928604157e+290 5.322240825425916e+290 4.6569607222476756e+290 3.991680619069448e+290 3.326400515891198e+290 2.6611204127129554e+290 1.9958403095347185e+290 1.3305602063564767e+290 6.652801031782502e+289]
	dgesvd.go:280: Case m=300,n=150,work=optimum,mtype=5,job=NoneU-NoneVT: singular values differ between full and partial SVD
		[9.9792015476736e+291 9.912673537355772e+291 9.84614552703796e+291 9.779617516720141e+291 9.713089506402309e+291 9.646561496084488e+291 9.580033485766658e+291 9.513505475448833e+291 9.446977465131016e+291 9.380449454813189e+291 9.313921444495371e+291 9.24739343417753e+291 9.18086542385971e+291 9.114337413541885e+291 9.047809403224058e+291 8.981281392906233e+291 8.914753382588416e+291 8.848225372270585e+291 8.781697361952766e+291 8.715169351634938e+291 8.648641341317114e+291 8.582113330999295e+291 8.515585320681465e+291 8.449057310363648e+291 8.382529300045816e+291 8.316001289727993e

We never see this with the gonum implementation, also note that the output truncates (this happens elsewhen as well - without this marked failure - and it suggests to me that OpenBLAS is doing something very bad behind the scenes).

What version of Go, Gonum, Gonum/netlib and C implementation are you using?

This was in the go1.9.x tests for #56, but I have restarted it.

Does this issue reproduce with the current master?

Yes.

@kortschak kortschak changed the title Dg lapack/netlib: Dgesvd fails sporadically on travis Mar 13, 2019
@kortschak
Copy link
Member Author

I have been unable to repeat this locally, so I suspect it is an unfortunate interaction between travis' virtualisation and OpenBLAS that we are tickling, possibly the kernel is killing it or it is aborting itself.

I will try running this on a VM when I get some time.

@martin-frbg I know how you love Gonum/OpenBLAS interactor bugs (and this is worse since it probably is also a travis interactor), but have you seen anything like this or have any ideas?

@martin-frbg
Copy link

Need more information - which version of OpenBLAS, what CPU is provided by your CI ?
(What I really love is how all the cute little buglets from up to ten years ago appear to have matured and become assertive thanks to more advanced compilers.)

@kortschak
Copy link
Member Author

They seem to have settled down, but I'll add some additional instrumentation to the OpenBLAS install script to give that information for when we see it next. We currently log the OpenBLAS version only (OpenMathLib/OpenBLAS@4fc17d0).

@martin-frbg
Copy link

Hmm. If you are actually tracking current develop you should already have all the recent fixes for register overwrites in the inline assembly. "Our" Travis setup uses Nehalem kernels but I have no idea if that is what everybody gets there.

@kortschak
Copy link
Member Author

I have just caught one in adding the instrumentation. The complete log is here (raw), but for the version and cpuinfo:

OpenBLAS version:4fc17d0d754b7905667fb84a68cf37a0d28a93bd

cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2300.000
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat
bugs		:
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2300.000
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat
bugs		:
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

@martin-frbg
Copy link

cpuid corresponds to HASWELL target (and I assume your OpenBLAS is either DYNAMIC_ARCH or built on/for this host).

@kortschak
Copy link
Member Author

Yes, built on the host.

+++make FC=gfortran
+++make PREFIX=/home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache install
make -j 2 -f Makefile.install install
make[1]: Entering directory `/home/travis/gopath/src/gonum.org/v1/netlib/OpenBLAS'
Generating openblas_config.h in /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/include
Generating f77blas.h in /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/include
Generating cblas.h in /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/include
Copying LAPACKE header files to /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/include
Copying the static library to /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/lib
Copying the shared library to /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/lib
Generating openblas.pc in /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/lib/pkgconfig
Generating OpenBLASConfig.cmake in /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/lib/cmake/openblas
Generating OpenBLASConfigVersion.cmake in /home/travis/gopath/src/gonum.org/v1/netlib/.travis/OpenBLAS.cache/lib/cmake/openblas
Install OK!

@martin-frbg
Copy link

Any idea when you saw this first (i.e., is it a relatively recent regression) ? DGESVD must be one of the "worst" functions in terms of BLAS usage so this will probably take some time to track down. Is the "gonum implementation" you mentioned above completely self-contained or does it use stock netlib LAPACK and BLAS ?

@kortschak
Copy link
Member Author

We haven't been touching this repo for a while, but we have been seeing a lot of it since @vladimir-ch started fixing up our sanity assertion for entry into the C code from Go. I can see errors like this going back about a month, but they may have existed for longer.

The Go Dgesvd is a pure Go implementation ported from the Fortran.

@martin-frbg
Copy link

If it is sporadic and not reproducible outside of Travis, I am more inclined to think of it as a Travis bug (hardware or whatever) - unless the random component in your matrix generation (if I read your test code correctly) makes it happen only with very specific and relatively rare combinations of data. I assume dumping the original matrix to a file in the failure case to allow re-runs with the exact same data is not practicable ?

@kortschak
Copy link
Member Author

I think it is very likely to be either hardware bugs or concurrency bugs somewhere (maybe interactions between Go and OpenBLAS threading). We use defined random seeds, so the tests are always with the same data, but we could dump the matrix since we now the test case that fails. This would allow a possibility of a pure C reproducer to be written. My suspicion about the cause of the failure is that there is some memory corruption going on and this leads to test failures and then crashing, this means that dumping the input on failure is not guaranteed to work, it also means the the strategy above may not work.

Before I do try dumping out the failure locally, I'm going to try running on a VM here to see if that is the cause.

@martin-frbg
Copy link

martin-frbg commented Mar 15, 2019

memcheck on a simple testcase (stolen from an old ATLAS bug) is clean, but helgrind does show some possible races between the dgemm microkernel and the dgemm_otcopy routine (at least when built without OpenMP support). You can build OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 to work around this (at the cost of reduced perfomance).
(If this is the cause, it has been around "since forever")

@kortschak
Copy link
Member Author

I have just tried running the blas and lapack test suite from this repo (and the pure Go implementations from the gonum/gonum repo) on a VirtualBox VM (KVM, VT-x/AMD-V and nested paging enabled) running ubuntu 16.04 hosted on my laptop.

The pure Go implementations pass, but the Cgo implementations calling OpenBLAS fail (including the blas test suite for complex float64 - which I have not seen before). The lapack tests fail in new ways and don't complete before they are killed for timeout (I suspect they are trapped in a loop for some reason).

It's looking like a there is at least some kind of interaction between virtualisation, OpenBLAS and the Go runtime.

I'll try with the USE_SIMPLE_THREADED_LEVEL3=1 and see what that brings.

@kortschak
Copy link
Member Author

With that build option, I still see travis failures.

@martin-frbg
Copy link

Strange. I do not remember seeing hangs or other misbehaviour with OpenBLAS in VirtualBox (or qemu either). There was one somewhat recent issue with LAPACK calls looping indefinitely while trying to scale NaN values but this was before 3.8.0 - might help to know where they got killed. The other explanation for a hang would obviously be a thread deadlock.
If USE_SIMPLE_THREADED_LEVEL3 did not help it is much less likely to be a threading issue. Would probably need to do a git bisect, or at least try with the Haswell DGEMM changes from OpenMathLib/OpenBLAS#1921 reverted.

@kortschak
Copy link
Member Author

Bisecting this will be a nightmare due to iteration delays. I have sent a ticket to travis to see if they can suggest something.

@kortschak
Copy link
Member Author

I am a step closer to understanding this I think. There are two issues here, the logging output truncation and the deviation from expectations from Dgesvd. I am fairly sure that the truncation results from the use of t.Fatalf and its interaction with how we invoke the test suite; I think it is exiting travis' shell. I'll check this later today.

That leaves just the deviation from expectation, I will attempt to write a C reproducer for that from the failing input (there is only one which fails, though it does so in three separate test cases).

@kortschak
Copy link
Member Author

Now that I have a clearer view, it seems that the deviation is really just a matter of a relative error of 1e-13 which is where we set our tolerance. I suspect that maybe the sporadic nature of it comes from occasional changes of ordering of float operations in the concurrent code. Why this happens only on travis and not locally is still confusing, but less troubling.

@martin-frbg
Copy link

martin-frbg commented Mar 30, 2019

Thanks. I suspect the 1e-13 deviations could come from the (slowly) increasing use of FMA operations in the BLAS kernels. (OpenMathLib/OpenBLAS#1332 is an example where error propagation in a spurious feedback loop led to quite dramatic differences)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants