-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stokhos tests failing in Trilinos-atdm-white-ride-cuda-9.2-debug-pt build #3542
Comments
CC: @trilinos/framework Stokhos is not enabled in any of the app ATDM builds of Trilinos so we could disable Stokhos from a Trilinos CUDA PR build and not loose anything (w.r.t. to ATDM). Therefore, it is up to you if you want to include Stokhos in Trilinos PR testing with a CUDA build. |
I will look at the test failures. I would like to keep it enabled for testing if possible. Given the number of failures, something very low-level is breaking. My guess is BLAS/LAPACK and/or MPI given the problems with those we have had on that architecture. |
As I suspected, all of these tests are failing in the same place, inside the DSTEQR LAPACK function. If I replace the calls to this function with a local version from Netlib built alongside of Stokhos, all of the tests pass:
It looks like from the CMake output that this build is linking in the Netlib BLAS/LAPACK, so it isn't clear to me why it is failing. Was that module possibly compiled with a different version of the compiler than is used to build Trilinos? Note that the CUDA version shouldn't matter as none of this is in CUDA code. |
@etphipp, this env was likely installed by @nmhamster so I will ask him those details. |
@etphipp wrote:
This looks familiar: #3338 (comment) |
The LAPACK STEQR function, which computes eigenvalues/vectors of symmetric, tridiagonal matrices, seg faults on the IBM Power systems for some unknown reason. However PTEQR, which does the same for SPD systems, appears to work. So I replaced to the call to STEQR in the recurrence basis implementation with PTEQR. This was more complicated than it sounds because the recurrence matrix is not positive definite. So I had to put in a shift to make it so for PTEQR. This required loosening a few tight numerical tolerances in some tests.
Automatically Merged using Trilinos Pull Request AutoTester PR Title: Replace STEQR with PTEQR in Stokhos for issue #3542 PR Author: etphipp
PR #3741 was merged which replaces STEQR with PTEQR in Stokhos. Based on my testing, the tests should now pass on white/ride for this build. Can someone verify and then update or close the issue? |
As shown here, all of the 84 Stokhos tests are passing today including the 66 newly passing Stokhos tests (see the Closing this as complete. @etphipp, thanks for fixing! |
The LAPACK STEQR function, which computes eigenvalues/vectors of symmetric, tridiagonal matrices, seg faults on the IBM Power systems for some unknown reason. However PTEQR, which does the same for SPD systems, appears to work. So I replaced to the call to STEQR in the recurrence basis implementation with PTEQR. This was more complicated than it sounds because the recurrence matrix is not positive definite. So I had to put in a shift to make it so for PTEQR. This required loosening a few tight numerical tolerances in some tests.
The LAPACK STEQR function, which computes eigenvalues/vectors of symmetric, tridiagonal matrices, seg faults on the IBM Power systems for some unknown reason. However PTEQR, which does the same for SPD systems, appears to work. So I replaced to the call to STEQR in the recurrence basis implementation with PTEQR. This was more complicated than it sounds because the recurrence matrix is not positive definite. So I had to put in a shift to make it so for PTEQR. This required loosening a few tight numerical tolerances in some tests.
CC: @trilinos/stokhos, @rppawlo (Trilinos Nonlinear Solvers Product Area Lead)
Next Action Status
PR #3741 merged on 10/26/2018 repalced STEQR with PTEQR in Stokhos and on 10/27/2018 all 84 Stokhos tests passed in the
Trilinos-atdm-white-ride-cuda-9.2-debug-pt
build on 'ride' including these 66 tests that went from failing to passing.Description
The Stokhos package has 66 failing tests in the build
Trilinos-atdm-white-ride-cuda-9.2-debug-pt
on 'white' and 'ride' as shown here which shows the failing tests:The first failing test
Stokhos_AdaptivityToolsUnitTest_MPI_1
with detailed output shown here shows:Randomly looking at the output of several of the other tests I looked at all show segfaults like above.
This is an important build because we are targeting this build on 'white' and 'ride' as a Trilinos PR testing build (see #2464 ).
Steps to reproduce
One should be able to reproduce these build errors on either 'white' or 'ride' by cloning the Trilinos git repo, checking out the 'develop' branch, creating a build directory, and then doing:
The text was updated successfully, but these errors were encountered: