Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrepid2 tests failing on several ATDM Trilinos builds on 'mutrino' (HSW and KNL), 'ride-cuda', 'sems-rhel7', 'serrano', and 'waterman-cuda', 'cts1' starting 2019-11-07 #6246

Closed
bartlettroscoe opened this issue Nov 7, 2019 · 17 comments
Assignees
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project client: EMPIRE All issues that most directly target the ATDM EMPIRE code impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area pkg: Intrepid2 type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Nov 7, 2019

CC: @trilinos/intrepid2, @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

Description

As shown in this query the tests:

  • Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1
  • Intrepid2_unit-test_Discretization_Basis_HVOL_HEX_Cn_FEM_Serial_Test_01_SLFadDFad_MPI_1
  • Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1

are failing in the builds:

  • Trilinos-atdm-mutrino-intel-opt-openmp-HSW
  • Trilinos-atdm-mutrino-intel-opt-openmp-KNL
  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
  • Trilinos-atdm-serrano-intel-debug-openmp
  • Trilinos-atdm-serrano-intel-opt-openmp
  • Trilinos-atdm-waterman-cuda-9.2-debug
  • Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug

starting 2019-11-07 where the list of failing tests and builds were:

Build Name Test Name Status Time
Trilinos-atdm-mutrino-intel-opt-openmp-HSW Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1 Failed 9s 500ms
Trilinos-atdm-mutrino-intel-opt-openmp-HSW Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 1s 590ms
Trilinos-atdm-mutrino-intel-opt-openmp-KNL Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1 Failed 55s 100ms
Trilinos-atdm-mutrino-intel-opt-openmp-KNL Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 5s 570ms
Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 5s 20ms
Trilinos-atdm-waterman-cuda-9.2-debug Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 4s 810ms
Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug Intrepid2_unit-test_Discretization_Basis_HVOL_HEX_Cn_FEM_Serial_Test_01_SLFadDFad_MPI_1 Failed 10m 100ms
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 3s 680ms

As shown in this query, most of the test failures appear to be due to diffs failing to meet tolerances. The failing unit tests are:

Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1:

The following tests FAILED:
    4. AnalyticPolynomialsMatch_double_double_Hierarchical_HGRAD_LINE_UnitTest ... 

Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1:

The following tests FAILED:
    2. IntegratedLegendre_TwoPathsMatch_UnitTest ... 
    3. IntegratedLegendre_dtTwo

It looks like there are some pretty big diffs like with unit test AnalyticPolynomialsMatch_double_double_Hierarchical_HGRAD_LINE_UnitTest shown here showing:

 Check: rel_err(actual, expected)
        = rel_err(1.11022e-17, 0) = 0.047619
          <= tol = 2.22045e-14 : FAILED
 values for -1 differ for field ordinal 3: expected 0; actual 1.11022e-17 (diff: -1.11022e-17)
 
 Check: rel_err(actual, expected)
        = rel_err(-1.58603e-17, 0) = 0.0666667
          <= tol = 2.22045e-14 : FAILED
 values for -1 differ for field ordinal 4: expected 0; actual -1.58603e-17 (diff: 1.58603e-17)

And then there is some hard-to-see-what-is-failing test output for the unit test IntegratedLegendre_TwoPathsMatch_UnitTest like shown hear showing:

2. IntegratedLegendre_TwoPathsMatch_UnitTest ... 
 for polyOrder 2, x = 0, t = 0.2: -1.15648e-18 != 0 (diff = 1.15648e-18; tol = 2.22045e-14)
 for polyOrder 3, x = 0, t = 0.2: 1.83187e-19 != 9.71445e-21 (diff = 1.73472e-19; tol = 2.22045e-14)
 for polyOrder 4, x = 0, t = 0.2: -3.45953e-20 != -7.95495e-21 (diff = 2.66404e-20; tol = 2.22045e-14)
 for polyOrder 5, x = 0, t = 0.2: 8.39317e-21 != 2.01421e-22 (diff = 8.19175e-21; tol = 2.22045e-14)
 for polyOrder 6, x = 0, t = 0.2: -1.67405e-21 != -1.00874e-21 (diff = 6.65306e-22; tol = 2.22045e-14)
 for polyOrder 7, x = 0, t = 0.2: 3.82338e-22 != 3.25e-22 (diff = 5.73376e-23; tol = 2.22045e-14)
 for polyOrder 8, x = 0, t = 0.2: -8.77301e-23 != -5.72369e-23 (diff = 3.04932e-23; tol = 2.22045e-14)
 for polyOrder 9, x = 0, t = 0.2: 1.73751e-23 != 1.37628e-23 (diff = 3.61235e-24; tol = 2.22045e-14)
 for polyOrder 10, x = 0, t = 0.2: -3.53275e-24 != -3.4213e-24 (diff = 1.11452e-25; tol = 2.22045e-14)
 for polyOrder 2, x = 0, t = 0.4: -4.62593e-18 != 0 (diff = 4.62593e-18; tol = 2.22045e-14)
 for polyOrder 3, x = 0, t = 0.4: 1.46549e-18 != 7.77156e-20 (diff = 1.38778e-18; tol = 2.22045e-14)
 for polyOrder 4, x = 0, t = 0.4: -5.53525e-19 != -1.27279e-19 (diff = 4.26246e-19; tol = 2.22045e-14)
 for polyOrder 5, x = 0, t = 0.4: 2.68581e-19 != 6.44546e-21 (diff = 2.62136e-19; tol = 2.22045e-14)
 for polyOrder 6, x = 0, t = 0.4: -1.07139e-19 != -6.45595e-20 (diff = 4.25796e-20; tol = 2.22045e-14)
 for polyOrder 7, x = 0, t = 0.4: 4.89393e-20 != 4.16001e-20 (diff = 7.33921e-21; tol = 2.22045e-14)
 for polyOrder 8, x = 0, t = 0.4: -2.24589e-20 != -1.46526e-20 (diff = 7.80626e-21; tol = 2.22045e-14)
 for polyOrder 9, x = 0, t = 0.4: 8.89608e-21 != 7.04656e-21 (diff = 1.84952e-21; tol = 2.22045e-14)
 for polyOrder 10, x = 0, t = 0.4: -3.61754e-21 != -3.50341e-21 (diff = 1.14127e-22; tol = 2.22045e-14)
 for polyOrder 3, x = 0, t = 0.6: 1.42109e-18 != -1.35447e-18 (diff = 2.77556e-18; tol = 2.22045e-14)
 for polyOrder 4, x = 0, t = 0.6: -1.79856e-18 != 1.8398e-19 (diff = 1.98254e-18; tol = 2.22045e-14)
 for polyOrder 5, x = 0, t = 0.6: 1.01203e-18 != 2.51651e-20 (diff = 9.86865e-19; tol = 2.22045e-14)
 for polyOrder 6, x = 0, t = 0.6: -7.92033e-19 != 1.54018e-20 (diff = 8.07435e-19; tol = 2.22045e-14)
 for polyOrder 7, x = 0, t = 0.6: 5.68643e-19 != 1.88687e-20 (diff = 5.49774e-19; tol = 2.22045e-14)
 for polyOrder 8, x = 0, t = 0.6: -2.61462e-19 != -1.6288e-20 (diff = 2.45174e-19; tol = 2.22045e-14)
 for polyOrder 9, x = 0, t = 0.6: 1.42095e-19 != -8.92826e-21 (diff = 1.51023e-19; tol = 2.22045e-14)
 for polyOrder 10, x = 0, t = 0.6: -7.99345e-20 != 6.80169e-21 (diff = 8.67362e-20; tol = 2.22045e-14)
 for polyOrder 2, x = 0, t = 0.8: -1.85037e-17 != 0 (diff = 1.85037e-17; tol = 2.22045e-14)
 for polyOrder 3, x = 0, t = 0.8: 1.1724e-17 != 6.21725e-19 (diff = 1.11022e-17; tol = 2.22045e-14)
 for polyOrder 4, x = 0, t = 0.8: -8.85641e-18 != -2.03647e-18 (diff = 6.81994e-18; tol = 2.22045e-14)
 for polyOrder 5, x = 0, t = 0.8: 8.59461e-18 != 2.06255e-19 (diff = 8.38835e-18; tol = 2.22045e-14)
 for polyOrder 6, x = 0, t = 0.8: -6.8569e-18 != -4.13181e-18 (diff = 2.72509e-18; tol = 2.22045e-14)
 for polyOrder 7, x = 0, t = 0.8: 6.26423e-18 != 5.32481e-18 (diff = 9.39419e-19; tol = 2.22045e-14)
 for polyOrder 8, x = 0, t = 0.8: -5.74948e-18 != -3.75108e-18 (diff = 1.9984e-18; tol = 2.22045e-14)
 for polyOrder 9, x = 0, t = 0.8: 4.55479e-18 != 3.60784e-18 (diff = 9.46955e-19; tol = 2.22045e-14)
 for polyOrder 10, x = 0, t = 0.8: -3.70436e-18 != -3.5875e-18 (diff = 1.16866e-19; tol = 2.22045e-14)
 [FAILED]  (0.00569 sec) IntegratedLegendre_TwoPathsMatch_UnitTest
 Location: /lscratch1/jenkins/mutrino-slave/workspace/Trilinos-atdm-mutrino-intel-opt-openmp-HSW/SRC_AND_BUILD/Trilinos/packages/intrepid2/unit-test/Shared/Polylib/LegendreJacobiPolynomials/IntegratedLegendreTests.cpp:233

(I can't see why this failed because it looks like all of the diffs are smaller than the tols.)

And then the test Intrepid2_unit-test_Discretization_Basis_HVOL_HEX_Cn_FEM_Serial_Test_01_SLFadDFad_MPI_1 failed in the cuda-10.1 build as shown here with the error output:

[Intrepid2] Error in file /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug/SRC_AND_BUILD/Trilinos/packages/intrepid2/src/Discretization/Basis/Intrepid2_BasisDef.hpp, line 781
            Test that evaluated to true: ( (spaceDim == 3) && ( (operatorType == OPERATOR_DIV) || (operatorType == OPERATOR_CURL) ) )
            >>> ERROR: (Intrepid2::getValues_HGRAD_Args) DIV and CURL are invalid operators for rank-0 (scalar) fields in 3D. 

Looking at the new commits pulled on 2019-11-07 here, it seems pretty likely that the commit 5d2a1c6:

5d2a1c6:  Intrepid2: implemented hierarchical bases on hexahedron, quadrilateral, and line, along with components to allow efficient implementation of tensor-product bases. (#5996)
Author: Nate Roberts <nvrober@sandia.gov>
Date:   Wed Nov 6 14:41:06 2019 -0700

triggered these failures.

Current Status on CDash

Steps to Reproduce

One should be able to reproduce these failures on any of the above listed machines described in:

More specifically, the commands given for the system are provided at:

The exact commands to reproduce this issue on any of these machines should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh <build-name>

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Intrepid2=ON \
 $TRILINOS_DIR

$ make NP=16

$ <command-to-run-on-compute-node> ctest -j4

where you just fill in the <build-name> and the <command-to-run-on-compute-node> for the given build and machine as described in:

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Intrepid2 client: ATDM Any issue primarily impacting the ATDM project client: EMPIRE All issues that most directly target the ATDM EMPIRE code ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs PA: Data Services Issues that fall under the Trilinos Data Services Product Area labels Nov 7, 2019
@bartlettroscoe
Copy link
Member Author

FYI: Also failings on 'serrano' too.

@bartlettroscoe bartlettroscoe changed the title Intrepid2 tests failing on several ATDM Trilinos builds on 'mutrino' (HSW and KNL), 'ride-cuda', 'sems-rhel7', and 'waterman-cuda' starting 2019-11-07 Intrepid2 tests failing on several ATDM Trilinos builds on 'mutrino' (HSW and KNL), 'ride-cuda', 'sems-rhel7', 'serrano', and 'waterman-cuda' starting 2019-11-07 Nov 8, 2019
@CamelliaDPG CamelliaDPG self-assigned this Nov 8, 2019
@CamelliaDPG
Copy link
Contributor

Thanks, @bartlettroscoe!

@CamelliaDPG CamelliaDPG mentioned this issue Nov 8, 2019
@bartlettroscoe
Copy link
Member Author

FYI: PR #6425 disabled these Intrepid2 tests for the build:

  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug

@bartlettroscoe bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Dec 12, 2019
@bartlettroscoe
Copy link
Member Author

FYI: Looks like the test:

  • Intrepid2_unit-test_Discretization_Basis_HVOL_HEX_Cn_FEM_Serial_Test_01_SLFadDFad_MPI_1

in the build:

  • Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug

is randomly hanging and timing out. When it passes, it passes quickly (7s). Otherwise, it is failing. It looks like it is hanging about every other day.

@bartlettroscoe
Copy link
Member Author

FYI: As shown in this query, we are still seeing these three tests frequently failing on some platforms. Therefore, this issue is not resolved yet.

@bartlettroscoe
Copy link
Member Author

FYI: I updated the links in the section "Current Status on CDash" above.

@CamelliaDPG
Copy link
Contributor

@bartlettroscoe Thanks for letting us know. I assigned myself this issue and wrote PR #6248 to fix, and probably I remain the right person to address this, but I won't get to it immediately. I am making a note to look into this early in the new year.

@mperego
Copy link
Contributor

mperego commented Dec 20, 2019

@CamelliaDPG thanks!

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jan 7, 2020

As shown in this query, the test:

  • Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1

is also failing in the new builds:

  • Trilinos-atdm-cts1-intel-18.0.2_openmpi-2.0.3_openmp_static_opt
  • Trilinos-atdm-cts1-intel-19.0.5_openmpi-4.0.1_openmp_static_opt

in a similar way to shown above where it shows some diffs but in this case the unit tests that failed were:

The following tests FAILED:
    2. IntegratedLegendre_TwoPathsMatch_UnitTest ... 
    3. IntegratedLegendre_dtTwoPathsMatch_UnitTest ... 

@bartlettroscoe
Copy link
Member Author

As shown in this query, this query, and this query, the test:

  • Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1

is failing in the new build:

  • Trilinos-atdm-cts1-intel-19.0.5_openmpi-4.0.1_openmp_static_opt

showing the failing unit test:

 [FAILED]  (<sec> sec) AnalyticPolynomialsMatch_double_double_HierarchicalNodalComparisons_UnitTest

looking to show diffs:

        = rel_err(2.77556e-17, 0) = 0.111111
          <= tol = 2.22045e-14 : FAILED

Therefore, this looks to be a diffing test, like reported for the other builds above.

NOTE: I can't see any more because when you try to look at the detailed test output like here, my browser just freezes. But you can get the raw test data here and it appears to be a massive amount of data. This must be a defect in CTest or something :-(

@bartlettroscoe bartlettroscoe changed the title Intrepid2 tests failing on several ATDM Trilinos builds on 'mutrino' (HSW and KNL), 'ride-cuda', 'sems-rhel7', 'serrano', and 'waterman-cuda' starting 2019-11-07 Intrepid2 tests failing on several ATDM Trilinos builds on 'mutrino' (HSW and KNL), 'ride-cuda', 'sems-rhel7', 'serrano', and 'waterman-cuda', 'cts1' starting 2019-11-07 Jan 8, 2020
@CamelliaDPG
Copy link
Contributor

Closed by #6594.

@bartlettroscoe
Copy link
Member Author

@CamelliaDPG, this query still shows the following failing tests:

Site Build Name Test Name Status Time Proc Time Details Build Time Processors
ride Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug Intrepid2_unit-test_Discretization_Basis_HVOL_HEX_Cn_FEM_Serial_Test_01_SLFadDFad_MPI_1 Failed 10m 70ms 10m 70ms Completed (Timeout) 2020-01-23T03:02:41 MST 1
attaway Trilinos-atdm-cts1-intel-18.0.2_openmpi-2.0.3_openmp_static_opt Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 1s 390ms 1s 390ms Completed (Failed) 2020-01-23T02:08:40 MST 1
eclipse Trilinos-atdm-cts1-intel-19.0.5_openmpi-4.0.1_openmp_static_opt Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 2s 140ms 2s 140ms Completed (Failed) 2020-01-23T02:08:46 MST 1
mutrino Trilinos-atdm-mutrino-intel-opt-openmp-HSW Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 4s 280ms 4s 280ms Completed (Failed) 2020-01-23T03:16:57 MST 1
mutrino Trilinos-atdm-mutrino-intel-opt-openmp-KNL Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1 Failed 5s 680ms 5s 680ms Completed (Failed) 2020-01-23T02:06:53 MST 1

Reopening.

@bartlettroscoe bartlettroscoe added the PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area label Jan 23, 2020
@bartlettroscoe bartlettroscoe removed the PA: Data Services Issues that fall under the Trilinos Data Services Product Area label Jan 23, 2020
@mperego
Copy link
Contributor

mperego commented Feb 1, 2020

@CamelliaDPG can you please look into the failing hierarchical tests?

jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Feb 2, 2020
…s:develop' (2bfd2c7).

* trilinos-develop: (129 commits)
  Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246)
  ATDM: cee-rhel6: Change from openmpi-4.0.1 to openmpi-4.0.2 (ATDV-289)
  Add test for periodic parallel fpp decomposition
  IOSS: cgns - fix handling of periodic single-block models in parallel
  Tpetra: fixing small typo
  Moertel: fix warnings due to signed/unsigned comparison, see issue trilinos#6698
  Incoprporating changes suggested in PR
  Tpetra: adding benchmark for CrsMatrix::apply, see issue trilinos#6692
  Add clang-7.0.1 explicitly in parsing (ATDV-291)
  Tpetra::CrsGraph: Fix build errors relating to unique_ptr
  MueLu ParameterListInterpreter test: Remove and regenerate xml
  Tpetra::CrsGraph::makeIndicesLocal: Add verbose output
  Tpetra::CrsMatrix::globalAssemble: Add verbose debugging output
  Tpetra::CrsMatrix::fillLocalMatrix: Add verbose output
  Tpetra::CrsMatrix::fillLocalGraphAndMatrix: Add verbose output
  Tpetra::CrsMatrix: Add more verbose output on allocation
  Tpetra: Add verbose debugging output to padCrsArrays
  PyTrilinos: Update swig interface files for SWIG 4.0
  PyTrilinos: Update build system for SWIG 4.0
  PyTrilinos: Replace SWIGEMPTYHACK with PYTRILINOS_NULLSTR
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Feb 2, 2020
…s:develop' (2bfd2c7).

* trilinos-develop: (130 commits)
  Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246)
  Tempus: Add Improved Doc for DIRK
  ATDM: cee-rhel6: Change from openmpi-4.0.1 to openmpi-4.0.2 (ATDV-289)
  Add test for periodic parallel fpp decomposition
  IOSS: cgns - fix handling of periodic single-block models in parallel
  Tpetra: fixing small typo
  Moertel: fix warnings due to signed/unsigned comparison, see issue trilinos#6698
  Incoprporating changes suggested in PR
  Tpetra: adding benchmark for CrsMatrix::apply, see issue trilinos#6692
  Add clang-7.0.1 explicitly in parsing (ATDV-291)
  Tpetra::CrsGraph: Fix build errors relating to unique_ptr
  MueLu ParameterListInterpreter test: Remove and regenerate xml
  Tpetra::CrsGraph::makeIndicesLocal: Add verbose output
  Tpetra::CrsMatrix::globalAssemble: Add verbose debugging output
  Tpetra::CrsMatrix::fillLocalMatrix: Add verbose output
  Tpetra::CrsMatrix::fillLocalGraphAndMatrix: Add verbose output
  Tpetra::CrsMatrix: Add more verbose output on allocation
  Tpetra: Add verbose debugging output to padCrsArrays
  PyTrilinos: Update swig interface files for SWIG 4.0
  PyTrilinos: Update build system for SWIG 4.0
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Feb 3, 2020
…s:develop' (2bfd2c7).

* trilinos-develop: (130 commits)
  Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246)
  Tempus: Add Improved Doc for DIRK
  ATDM: cee-rhel6: Change from openmpi-4.0.1 to openmpi-4.0.2 (ATDV-289)
  Add test for periodic parallel fpp decomposition
  IOSS: cgns - fix handling of periodic single-block models in parallel
  Tpetra: fixing small typo
  Moertel: fix warnings due to signed/unsigned comparison, see issue trilinos#6698
  Incoprporating changes suggested in PR
  Tpetra: adding benchmark for CrsMatrix::apply, see issue trilinos#6692
  Add clang-7.0.1 explicitly in parsing (ATDV-291)
  Tpetra::CrsGraph: Fix build errors relating to unique_ptr
  MueLu ParameterListInterpreter test: Remove and regenerate xml
  Tpetra::CrsGraph::makeIndicesLocal: Add verbose output
  Tpetra::CrsMatrix::globalAssemble: Add verbose debugging output
  Tpetra::CrsMatrix::fillLocalMatrix: Add verbose output
  Tpetra::CrsMatrix::fillLocalGraphAndMatrix: Add verbose output
  Tpetra::CrsMatrix: Add more verbose output on allocation
  Tpetra: Add verbose debugging output to padCrsArrays
  PyTrilinos: Update swig interface files for SWIG 4.0
  PyTrilinos: Update build system for SWIG 4.0
  ...
@CamelliaDPG
Copy link
Contributor

@mperego, yes, I'll take a look.

@CamelliaDPG
Copy link
Contributor

@mperego, the Hierarchical Basis tests look like they only failed once recently on white, and not on any other testbeds, and when they did (2020-01-25 07:59:09), every Intrepid2 test failed with errors like the following:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetDeviceCount( & m_cudaDevCount ) error( cudaErrorInsufficientDriver): CUDA driver version is insufficient for CUDA runtime version /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:204
Traceback functionality not available

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node white35 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

This suggests a problem with the execution environment, likely a transient problem, not an issue with the tests.

On the other hand, the JacobiLegendrePolynomial_Tests do still appear to have some architectural sensitivities; I'll work on eliminating those and work up a PR to re-enable both sets of tests.

jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Feb 4, 2020
…s:develop' (2bfd2c7).

* trilinos-develop: (177 commits)
  Add a fix for a stk cmake file
  Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72)
  Intrepid2: remove unnecessary finalize calls in unit tests
  Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166)
  Disable some timing out ROL tests (trilinos#6124)
  Disable timing out Tempus tests on ats2 (trilinos#6009)
  fixed some broken teuchos unit tests and removed missed deprecated methods
  Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27)
  removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector
  removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments
  removed deprecated MPITraits class
  removed deprecated ArrayArg class
  removed deprecated LAPACK::GEBAL method that takes ilo and ihi by value
  removed deprecated LAPACK::POSVX and LAPACK::GESVX methods that take EQUED by value
  removed deprecated LAPACK::TREXC method that takes ifst and ilst by value
  removed deprecated count method in ArrayRCP, RCP, and RCPNode
  removed deprecated PerformanceMonitorBase::clearTimer methods
  Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246)
  Remove misspelled RTop_HIDE_DEPRECATED_CODE (trilinos#6217)
  Disable/hide deprecated code (trilinos#6217)
  ...
@mperego
Copy link
Contributor

mperego commented Feb 10, 2020

I think this can be safely closed. Thanks @bartlettroscoe for reporting the issue and @CamelliaDPG for fixing it.

@mperego mperego closed this as completed Feb 10, 2020
@bartlettroscoe
Copy link
Member Author

As shown on CDash yesterday, all of these tests do seem to be passing.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project client: EMPIRE All issues that most directly target the ATDM EMPIRE code impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area pkg: Intrepid2 type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants