Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stokhos tests failing in Trilinos-atdm-white-ride-cuda-9.2-debug-pt build #3542

Closed
bartlettroscoe opened this issue Oct 2, 2018 · 7 comments
Labels
ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos client: ATDM Any issue primarily impacting the ATDM project PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area pkg: Stokhos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Oct 2, 2018

CC: @trilinos/stokhos, @rppawlo (Trilinos Nonlinear Solvers Product Area Lead)

Next Action Status

PR #3741 merged on 10/26/2018 repalced STEQR with PTEQR in Stokhos and on 10/27/2018 all 84 Stokhos tests passed in the Trilinos-atdm-white-ride-cuda-9.2-debug-pt build on 'ride' including these 66 tests that went from failing to passing.

Description

The Stokhos package has 66 failing tests in the build Trilinos-atdm-white-ride-cuda-9.2-debug-pt on 'white' and 'ride' as shown here which shows the failing tests:

  • Stokhos_AdaptivityToolsUnitTest_MPI_1
  • Stokhos_AlgebraicExpansionUnitTest_MPI_1
  • Stokhos_BasisInteractionGraphUnitTest_MPI_1
  • Stokhos_division_example_MPI_1
  • Stokhos_DivisionOperatorUnitTest_MPI_1
  • Stokhos_ExponentialRandomFieldUnitTest_MPI_1
  • Stokhos_GramSchmidtUnitTest_MPI_1
  • Stokhos_hermite_example_MPI_1
  • Stokhos_HermiteBasisUnitTest_MPI_1
  • Stokhos_InterlacedMapUnitTest_MPI_2
  • Stokhos_InterlacedOpUnitTest_MPI_2
  • Stokhos_JacobiBasisUnitTest_MPI_1
  • Stokhos_KokkosArrayKernelsUnitTest_Cuda_MPI_1
  • Stokhos_KokkosArrayKernelsUnitTest_Serial_MPI_1
  • Stokhos_KokkosCrsMatrixUQPCEUnitTest_Cuda_MPI_1
  • Stokhos_KokkosCrsMatrixUQPCEUnitTest_Serial_MPI_1
  • Stokhos_KokkosViewUQPCEUnitTest_Cuda_MPI_1
  • Stokhos_KokkosViewUQPCEUnitTest_Serial_MPI_1
  • Stokhos_LanczosUnitTest_MPI_1
  • Stokhos_LegendreBasisUnitTest_MPI_1
  • Stokhos_LexicographicTreeBasisUnitTest_MPI_1
  • Stokhos_Linear2D_Diffusion_CG_AGS_MPI_2
  • Stokhos_Linear2D_Diffusion_GMRES_AGS_MPI_2
  • Stokhos_Linear2D_Diffusion_GMRES_AJ_MPI_2
  • Stokhos_Linear2D_Diffusion_GMRES_FA_MPI_2
  • Stokhos_Linear2D_Diffusion_GMRES_GS_MPI_2
  • Stokhos_Linear2D_Diffusion_GMRES_KL_MPI_2
  • Stokhos_Linear2D_Diffusion_GMRES_KLR_MPI_2
  • Stokhos_Linear2D_Diffusion_GMRES_KP_MPI_2
  • Stokhos_Linear2D_Diffusion_GMRES_Mean_Based_MPI_2
  • Stokhos_Linear2D_Diffusion_GS_MPI_2
  • Stokhos_Linear2D_Diffusion_GSLN_MPI_2
  • Stokhos_Linear2D_Diffusion_JA_MPI_2
  • Stokhos_Linear2D_Diffusion_LN_MPI_2
  • Stokhos_Linear2D_Diffusion_PCE_Example_MPI_2
  • Stokhos_Linear2D_Diffusion_PCE_Interlaced_Example_MPI_2
  • Stokhos_Linear2D_Diffusion_PCE_NOX_Example_MPI_2
  • Stokhos_LogNormalUnitTest_MPI_1
  • Stokhos_MatrixFreeOperatorUnitTest_MPI_1
  • Stokhos_NormalizedHermiteBasisUnitTest_MPI_1
  • Stokhos_NormalizedLegendreBasisUnitTest_MPI_1
  • Stokhos_nox_example_MPI_1
  • Stokhos_ProductBasisUtilsUnitTest_MPI_1
  • Stokhos_QuadExpansionUnitTest_MPI_1
  • Stokhos_QuadraturePseudoSpectralExpansionUnitTest_MPI_1
  • Stokhos_sacado_ensemble_example_MPI_1
  • Stokhos_sacado_example_MPI_1
  • Stokhos_SacadoETPCEUnitTest_MPI_1
  • Stokhos_SacadoPCECommTests_MPI_1
  • Stokhos_SacadoPCESerializationTests_MPI_1
  • Stokhos_SacadoPCEUnitTest_MPI_1
  • Stokhos_SacadoUQPCECommTests_MPI_1
  • Stokhos_SacadoUQPCESerializationTests_MPI_1
  • Stokhos_SacadoUQPCEUnitTest_MPI_1
  • Stokhos_SmolyakBasisUnitTest_MPI_1
  • Stokhos_SmolyakPseudoSpectralExpansionUnitTest_MPI_1
  • Stokhos_Sparse3TensorUnitTest_MPI_1
  • Stokhos_SparseGridQuadratureUnitTest_MPI_1
  • Stokhos_StieltjesUnitTest_MPI_1
  • Stokhos_TensorProductBasisUnitTest_MPI_1
  • Stokhos_TensorProductPseudoSpectralExpansionUnitTest_MPI_1
  • Stokhos_TensorProductPseudoSpectralOperatorUnitTest_MPI_1
  • Stokhos_TotalOrderBasisUnitTest_MPI_1
  • Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4
  • Stokhos_TpetraCrsMatrixUQPCEUnitTest_Serial_MPI_4
  • Stokhos_uq_handbook_nonlinear_sg_example_MPI_1

The first failing test Stokhos_AdaptivityToolsUnitTest_MPI_1 with detailed output shown here shows:

Sorting tests by group name then by the order they were added ... (time = 7.44e-06)

Running unit tests ...

--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node white27 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Randomly looking at the output of several of the other tests I looked at all show segfaults like above.

This is an important build because we are targeting this build on 'white' and 'ride' as a Trilinos PR testing build (see #2464 ).

Steps to reproduce

One should be able to reproduce these build errors on either 'white' or 'ride' by cloning the Trilinos git repo, checking out the 'develop' branch, creating a build directory, and then doing:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-9.2-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stokhos=ON \
  $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j16
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Stokhos client: ATDM Any issue primarily impacting the ATDM project labels Oct 2, 2018
@bartlettroscoe
Copy link
Member Author

CC: @trilinos/framework

@etphipp and @rppawlo,

Stokhos is not enabled in any of the app ATDM builds of Trilinos so we could disable Stokhos from a Trilinos CUDA PR build and not loose anything (w.r.t. to ATDM). Therefore, it is up to you if you want to include Stokhos in Trilinos PR testing with a CUDA build.

@etphipp
Copy link
Contributor

etphipp commented Oct 2, 2018

I will look at the test failures. I would like to keep it enabled for testing if possible. Given the number of failures, something very low-level is breaking. My guess is BLAS/LAPACK and/or MPI given the problems with those we have had on that architecture.

@etphipp
Copy link
Contributor

etphipp commented Oct 2, 2018

As I suspected, all of these tests are failing in the same place, inside the DSTEQR LAPACK function. If I replace the calls to this function with a local version from Netlib built alongside of Stokhos, all of the tests pass:

100% tests passed, 0 tests failed out of 84

Subproject Time Summary:
Stokhos    = 282.64 sec*proc (84 tests)

Total Test time (real) =  37.30 sec

It looks like from the CMake output that this build is linking in the Netlib BLAS/LAPACK, so it isn't clear to me why it is failing. Was that module possibly compiled with a different version of the compiler than is used to build Trilinos? Note that the CUDA version shouldn't matter as none of this is in CUDA code.

@bartlettroscoe
Copy link
Member Author

It looks like from the CMake output that this build is linking in the Netlib BLAS/LAPACK, so it isn't clear to me why it is failing. Was that module possibly compiled with a different version of the compiler than is used to build Trilinos?

@etphipp, this env was likely installed by @nmhamster so I will ask him those details.

@mhoemmen
Copy link
Contributor

mhoemmen commented Oct 2, 2018

@etphipp wrote:

As I suspected, all of these tests are failing in the same place, inside the DSTEQR LAPACK function.

This looks familiar: #3338 (comment)

etphipp added a commit to etphipp/Trilinos that referenced this issue Oct 25, 2018
The LAPACK STEQR function, which computes eigenvalues/vectors of
symmetric, tridiagonal matrices, seg faults on the IBM Power systems for
some unknown reason.  However PTEQR, which does the same for SPD
systems, appears to work.  So I replaced to the call to STEQR in the
recurrence basis implementation with PTEQR.  This was more complicated
than it sounds because the recurrence matrix is not positive definite.
So I had to put in a shift to make it so for PTEQR.  This required
loosening a few tight numerical tolerances in some tests.
trilinos-autotester added a commit that referenced this issue Oct 26, 2018
Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Replace STEQR with PTEQR in Stokhos for issue #3542
PR Author: etphipp
@etphipp
Copy link
Contributor

etphipp commented Oct 26, 2018

PR #3741 was merged which replaces STEQR with PTEQR in Stokhos. Based on my testing, the tests should now pass on white/ride for this build. Can someone verify and then update or close the issue?

@bartlettroscoe
Copy link
Member Author

As shown here, all of the 84 Stokhos tests are passing today including the 66 newly passing Stokhos tests (see the -66 subscript and +66 superscript by the number of passing and failing tests, respectively). You can also see the these newly passing tests flagged here and here.

Closing this as complete.

@etphipp, thanks for fixing!

masterleinad pushed a commit to masterleinad/Trilinos that referenced this issue Nov 9, 2018
The LAPACK STEQR function, which computes eigenvalues/vectors of
symmetric, tridiagonal matrices, seg faults on the IBM Power systems for
some unknown reason.  However PTEQR, which does the same for SPD
systems, appears to work.  So I replaced to the call to STEQR in the
recurrence basis implementation with PTEQR.  This was more complicated
than it sounds because the recurrence matrix is not positive definite.
So I had to put in a shift to make it so for PTEQR.  This required
loosening a few tight numerical tolerances in some tests.
@bartlettroscoe bartlettroscoe added the ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos label Nov 13, 2018
@bartlettroscoe bartlettroscoe added the PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area label Nov 30, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
The LAPACK STEQR function, which computes eigenvalues/vectors of
symmetric, tridiagonal matrices, seg faults on the IBM Power systems for
some unknown reason.  However PTEQR, which does the same for SPD
systems, appears to work.  So I replaced to the call to STEQR in the
recurrence basis implementation with PTEQR.  This was more complicated
than it sounds because the recurrence matrix is not positive definite.
So I had to put in a shift to make it so for PTEQR.  This required
loosening a few tight numerical tolerances in some tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Env Issue Issue with ATDM build or test caused (at least partly) by the env, not a bug in Trilinos client: ATDM Any issue primarily impacting the ATDM project PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area pkg: Stokhos type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants