-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intrepid2 tests failing on several ATDM Trilinos builds on 'mutrino' (HSW and KNL), 'ride-cuda', 'sems-rhel7', 'serrano', and 'waterman-cuda', 'cts1' starting 2019-11-07 #6246
Comments
FYI: Also failings on 'serrano' too. |
Thanks, @bartlettroscoe! |
FYI: PR #6425 disabled these Intrepid2 tests for the build:
|
FYI: Looks like the test:
in the build:
is randomly hanging and timing out. When it passes, it passes quickly (7s). Otherwise, it is failing. It looks like it is hanging about every other day. |
FYI: As shown in this query, we are still seeing these three tests frequently failing on some platforms. Therefore, this issue is not resolved yet. |
FYI: I updated the links in the section "Current Status on CDash" above. |
@bartlettroscoe Thanks for letting us know. I assigned myself this issue and wrote PR #6248 to fix, and probably I remain the right person to address this, but I won't get to it immediately. I am making a note to look into this early in the new year. |
@CamelliaDPG thanks! |
As shown in this query, the test:
is also failing in the new builds:
in a similar way to shown above where it shows some diffs but in this case the unit tests that failed were:
|
As shown in this query, this query, and this query, the test:
is failing in the new build:
showing the failing unit test:
looking to show diffs:
Therefore, this looks to be a diffing test, like reported for the other builds above. NOTE: I can't see any more because when you try to look at the detailed test output like here, my browser just freezes. But you can get the raw test data here and it appears to be a massive amount of data. This must be a defect in CTest or something :-( |
Closed by #6594. |
@CamelliaDPG, this query still shows the following failing tests:
Reopening. |
@CamelliaDPG can you please look into the failing hierarchical tests? |
…s:develop' (2bfd2c7). * trilinos-develop: (129 commits) Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246) ATDM: cee-rhel6: Change from openmpi-4.0.1 to openmpi-4.0.2 (ATDV-289) Add test for periodic parallel fpp decomposition IOSS: cgns - fix handling of periodic single-block models in parallel Tpetra: fixing small typo Moertel: fix warnings due to signed/unsigned comparison, see issue trilinos#6698 Incoprporating changes suggested in PR Tpetra: adding benchmark for CrsMatrix::apply, see issue trilinos#6692 Add clang-7.0.1 explicitly in parsing (ATDV-291) Tpetra::CrsGraph: Fix build errors relating to unique_ptr MueLu ParameterListInterpreter test: Remove and regenerate xml Tpetra::CrsGraph::makeIndicesLocal: Add verbose output Tpetra::CrsMatrix::globalAssemble: Add verbose debugging output Tpetra::CrsMatrix::fillLocalMatrix: Add verbose output Tpetra::CrsMatrix::fillLocalGraphAndMatrix: Add verbose output Tpetra::CrsMatrix: Add more verbose output on allocation Tpetra: Add verbose debugging output to padCrsArrays PyTrilinos: Update swig interface files for SWIG 4.0 PyTrilinos: Update build system for SWIG 4.0 PyTrilinos: Replace SWIGEMPTYHACK with PYTRILINOS_NULLSTR ...
…s:develop' (2bfd2c7). * trilinos-develop: (130 commits) Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246) Tempus: Add Improved Doc for DIRK ATDM: cee-rhel6: Change from openmpi-4.0.1 to openmpi-4.0.2 (ATDV-289) Add test for periodic parallel fpp decomposition IOSS: cgns - fix handling of periodic single-block models in parallel Tpetra: fixing small typo Moertel: fix warnings due to signed/unsigned comparison, see issue trilinos#6698 Incoprporating changes suggested in PR Tpetra: adding benchmark for CrsMatrix::apply, see issue trilinos#6692 Add clang-7.0.1 explicitly in parsing (ATDV-291) Tpetra::CrsGraph: Fix build errors relating to unique_ptr MueLu ParameterListInterpreter test: Remove and regenerate xml Tpetra::CrsGraph::makeIndicesLocal: Add verbose output Tpetra::CrsMatrix::globalAssemble: Add verbose debugging output Tpetra::CrsMatrix::fillLocalMatrix: Add verbose output Tpetra::CrsMatrix::fillLocalGraphAndMatrix: Add verbose output Tpetra::CrsMatrix: Add more verbose output on allocation Tpetra: Add verbose debugging output to padCrsArrays PyTrilinos: Update swig interface files for SWIG 4.0 PyTrilinos: Update build system for SWIG 4.0 ...
…s:develop' (2bfd2c7). * trilinos-develop: (130 commits) Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246) Tempus: Add Improved Doc for DIRK ATDM: cee-rhel6: Change from openmpi-4.0.1 to openmpi-4.0.2 (ATDV-289) Add test for periodic parallel fpp decomposition IOSS: cgns - fix handling of periodic single-block models in parallel Tpetra: fixing small typo Moertel: fix warnings due to signed/unsigned comparison, see issue trilinos#6698 Incoprporating changes suggested in PR Tpetra: adding benchmark for CrsMatrix::apply, see issue trilinos#6692 Add clang-7.0.1 explicitly in parsing (ATDV-291) Tpetra::CrsGraph: Fix build errors relating to unique_ptr MueLu ParameterListInterpreter test: Remove and regenerate xml Tpetra::CrsGraph::makeIndicesLocal: Add verbose output Tpetra::CrsMatrix::globalAssemble: Add verbose debugging output Tpetra::CrsMatrix::fillLocalMatrix: Add verbose output Tpetra::CrsMatrix::fillLocalGraphAndMatrix: Add verbose output Tpetra::CrsMatrix: Add more verbose output on allocation Tpetra: Add verbose debugging output to padCrsArrays PyTrilinos: Update swig interface files for SWIG 4.0 PyTrilinos: Update build system for SWIG 4.0 ...
@mperego, yes, I'll take a look. |
@mperego, the Hierarchical Basis tests look like they only failed once recently on white, and not on any other testbeds, and when they did (2020-01-25 07:59:09), every Intrepid2 test failed with errors like the following:
This suggests a problem with the execution environment, likely a transient problem, not an issue with the tests. On the other hand, the |
…s:develop' (2bfd2c7). * trilinos-develop: (177 commits) Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class removed deprecated LAPACK::GEBAL method that takes ilo and ihi by value removed deprecated LAPACK::POSVX and LAPACK::GESVX methods that take EQUED by value removed deprecated LAPACK::TREXC method that takes ifst and ilst by value removed deprecated count method in ArrayRCP, RCP, and RCPNode removed deprecated PerformanceMonitorBase::clearTimer methods Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246) Remove misspelled RTop_HIDE_DEPRECATED_CODE (trilinos#6217) Disable/hide deprecated code (trilinos#6217) ...
I think this can be safely closed. Thanks @bartlettroscoe for reporting the issue and @CamelliaDPG for fixing it. |
As shown on CDash yesterday, all of these tests do seem to be passing. Thanks |
CC: @trilinos/intrepid2, @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
Description
As shown in this query the tests:
Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1
Intrepid2_unit-test_Discretization_Basis_HVOL_HEX_Cn_FEM_Serial_Test_01_SLFadDFad_MPI_1
Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1
are failing in the builds:
Trilinos-atdm-mutrino-intel-opt-openmp-HSW
Trilinos-atdm-mutrino-intel-opt-openmp-KNL
Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
Trilinos-atdm-serrano-intel-debug-openmp
Trilinos-atdm-serrano-intel-opt-openmp
Trilinos-atdm-waterman-cuda-9.2-debug
Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
starting 2019-11-07 where the list of failing tests and builds were:
As shown in this query, most of the test failures appear to be due to diffs failing to meet tolerances. The failing unit tests are:
Intrepid2_unit-test_Discretization_Basis_HierarchicalBases_Hierarchical_Basis_Tests_MPI_1:
Intrepid2_unit-test_Shared_Polylib_LegendreJacobiPolynomials_JacobiLegendrePolynomial_Tests_MPI_1:
It looks like there are some pretty big diffs like with unit test
AnalyticPolynomialsMatch_double_double_Hierarchical_HGRAD_LINE_UnitTest
shown here showing:And then there is some hard-to-see-what-is-failing test output for the unit test
IntegratedLegendre_TwoPathsMatch_UnitTest
like shown hear showing:(I can't see why this failed because it looks like all of the
diff
s are smaller than thetol
s.)And then the test
Intrepid2_unit-test_Discretization_Basis_HVOL_HEX_Cn_FEM_Serial_Test_01_SLFadDFad_MPI_1
failed in the cuda-10.1 build as shown here with the error output:Looking at the new commits pulled on 2019-11-07 here, it seems pretty likely that the commit 5d2a1c6:
triggered these failures.
Current Status on CDash
Failures in theses three tests in all ATDM Trilinos builds over last 30 days
Status of these tests in all ATDM Trilinos builds for current testing day
Steps to Reproduce
One should be able to reproduce these failures on any of the above listed machines described in:
More specifically, the commands given for the system are provided at:
The exact commands to reproduce this issue on any of these machines should be:
where you just fill in the
<build-name>
and the<command-to-run-on-compute-node>
for the given build and machine as described in:The text was updated successfully, but these errors were encountered: