-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures #6052
Comments
FYI: As shown in this query, the test:
is randomly failing in the build:
showing the error:
Therefore, this could be due to overloading memory in this build. Need to reduce parallel level on all builds to |
FYI: All of the ATDM Trilinos test failures showing errors like:
in the last month are shown in this query for 2019-09-08 through 2019-10-08 which only impacts several 4-proc MueLu tests in the following 'waterman' builds:
and the single STK test
(This STK test failure was reported long ago in #4551.) If these CUDA errors are really being caused by running out of GPU memory, then why are we not seeing large Panzer tests failing as well? Is it more likely that these errors are caused by UVM memory usage problems in MueLu and STK? In any case, I will reduce the parallel test level for these CUDA builds from |
As described in trilinos#6052, it is possible that running with ctest -j8 could cause out-of-memory errors on the GPU and when using UVM the errors are delayed and hard to diagnose. Therefore, to be safe, we need to drop to ctest -j4. That will cause all of the 4-proc MPI tests to run by themselves and will allow four 1-proc MPI tests to run. Hopefully this will eliminate any possibility of running out of mememory on the GPU. Hopefully Kitware can help us create a system where we can more effectively use the GPUs as part of trilinos#2422.
PR is #6069 |
…ctest-j-4 Automatically Merged using Trilinos Pull Request AutoTester PR Title: Reduce ctest -j8 to -j4 for all CUDA builds (#6052) PR Author: bartlettroscoe
FYI: Looks like there have been some tests failing showing errors (here) like:
But looking across all ATDM Trilinos builds between 9/1/2019 and 10/10/2019 that showed this error in this query there were only two such failing tests:
Therefore, this error is not common at all. This move to |
…s:develop' (a9ace5b). * trilinos-develop: Belos: treats cases where numResTests is zero Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Reduced from ctest -j8 to -j4 for all CUDA builds (trilinos#6052) Belos: solves Issue trilinos#6059 Belos: solves Issue trilinos#6059 Remove unused variable Remove commented out code. Address roundoff error in mesh centroid. Avoid a divide by zero error for the 1d mesh.
…s:develop' (a9ace5b). * trilinos-develop: Belos: treats cases where numResTests is zero Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Reduced from ctest -j8 to -j4 for all CUDA builds (trilinos#6052) Belos: solves Issue trilinos#6059 Belos: solves Issue trilinos#6059 Remove unused variable Remove commented out code. Address roundoff error in mesh centroid. Avoid a divide by zero error for the 1d mesh.
…artition-of-unity * 'develop' of https://github.com/searhein/Trilinos: (100 commits) Change initializer list order to make g++ happy Plumbing for BelosBlockCGSolMgr so that Assert Positive Definiteness flag gets passed to BelosCGIter Change Spack module from SuperLUDist 6.2.0 to 5.4.0 (CDOFA-66) Testing: Tpetra: make nightly build faster Update CrsMatrix_UnitTests4.cpp Tpetra: Fixes so that trilinos#6076 can pass MueLu: rebasing the gold files for phase 3 refactored outputs MueLu: refactor of Phase3 aggregation, see issue trilinos#5838 KokkosKernels - trsm transpose does not solve the problem correctly. Belos: treats cases where numResTests is zero Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Reduced from ctest -j8 to -j4 for all CUDA builds (trilinos#6052) Belos: solves Issue trilinos#6059 Belos: solves Issue trilinos#6059 ML: Code cleanup to RefMaxwell ML: Removing supperfluous import constructor ML: Adding new support routine Xpetra: disable Map cloner test if no Tpetra deprecated Phalanx - wrong number of kokkos allocation arguments. ...
To be safe, and to match with all of the CUDA builds, I reduced the default parallel level for ctest to ctest -j4. This should avoid problems with GPU overloading.
…ctest-j-4-doc Automatically Merged using Trilinos Pull Request AutoTester PR Title: Reduce to ctest -j4 in all documentation (#6052) PR Author: bartlettroscoe
…s:develop' (1683f23). * trilinos-develop: (21 commits) MueLu: fixing the count of aggregated nodes in refactored Phase2a of aggregation, see trilinos#6325 Teuchos StackedTimer: add unit test for 'proc_minmax' Teuchos StackedTimer: Print min time over all ranks that are active only Teuchos StackedTimer: add option to print rank with min/max time SEACAS: Bug fix since snapshot Reduce to ctest -j4 in all documentation (trilinos#6052) MueLu RefMaxwell: Print more matrix stats MueLu: gold file rebase and change of logic for issue trilinos#6269 MueLu: refactor of Dirichlet conditions handling and changes in UncoupledAggregation see issue trilinos#6269 Intrepid2: tweaks to OrientationTools::modifyBasisByOrientation() to allow reference-space inputs. Tempus: Add Cleanup of Error Tolerances. SEACAS: kluge to get new lib::fmt maybe working with nvcc Intrepid2: fixing issues with hierarchical parallelism policies, revealed when Trilinos is built with KOKKOS_ENABLE_DEPRECATED_CODE=OFF. (trilinos#6310) SEACAS: Another try at fixing nvcc build MueLu: fix type handling in regionMG unit test MueLu: remove debug output from regionMG unit test SEACAS: Attempt to fix CUDA compile errors Attempt to fix NVCC / INTEL compiler errors MueLu: fix misleading comments in regionMG unit tests Automatic snapshot commit from seacas at a34490f ...
…s:develop' (1683f23). * trilinos-develop: (21 commits) MueLu: fixing the count of aggregated nodes in refactored Phase2a of aggregation, see trilinos#6325 Teuchos StackedTimer: add unit test for 'proc_minmax' Teuchos StackedTimer: Print min time over all ranks that are active only Teuchos StackedTimer: add option to print rank with min/max time SEACAS: Bug fix since snapshot Reduce to ctest -j4 in all documentation (trilinos#6052) MueLu RefMaxwell: Print more matrix stats MueLu: gold file rebase and change of logic for issue trilinos#6269 MueLu: refactor of Dirichlet conditions handling and changes in UncoupledAggregation see issue trilinos#6269 Intrepid2: tweaks to OrientationTools::modifyBasisByOrientation() to allow reference-space inputs. Tempus: Add Cleanup of Error Tolerances. SEACAS: kluge to get new lib::fmt maybe working with nvcc Intrepid2: fixing issues with hierarchical parallelism policies, revealed when Trilinos is built with KOKKOS_ENABLE_DEPRECATED_CODE=OFF. (trilinos#6310) SEACAS: Another try at fixing nvcc build MueLu: fix type handling in regionMG unit test MueLu: remove debug output from regionMG unit test SEACAS: Attempt to fix CUDA compile errors Attempt to fix NVCC / INTEL compiler errors MueLu: fix misleading comments in regionMG unit tests Automatic snapshot commit from seacas at a34490f ...
…ne-to-one * 'develop' of https://github.com/trilinos/Trilinos: (38 commits) MueLu RefMaxwell: Fix bug in ParameterList interface tpetra: converted constant to scalar_type; needed to enable compilation when scalar_type = std::complex<float> (as noted in trilinos#6314) tpetra: loosened tolerance that was too tight when scalar=float MueLu PerfUtils: Show ranks with min/max MueLu RefMaxwell: More fine grained control over sub-solve placement MueLu: fixing the count of aggregated nodes in refactored Phase2a of aggregation, see trilinos#6325 Teuchos StackedTimer: add unit test for 'proc_minmax' Teuchos StackedTimer: Print min time over all ranks that are active only Teuchos StackedTimer: add option to print rank with min/max time SEACAS: Bug fix since snapshot Reduce to ctest -j4 in all documentation (trilinos#6052) MueLu RefMaxwell: Print more matrix stats MueLu: gold file rebase and change of logic for issue trilinos#6269 MueLu: refactor of Dirichlet conditions handling and changes in UncoupledAggregation see issue trilinos#6269 Intrepid2: tweaks to OrientationTools::modifyBasisByOrientation() to allow reference-space inputs. Tempus: Add Cleanup of Error Tolerances. SEACAS: kluge to get new lib::fmt maybe working with nvcc Intrepid2: fixing issues with hierarchical parallelism policies, revealed when Trilinos is built with KOKKOS_ENABLE_DEPRECATED_CODE=OFF. (trilinos#6310) SEACAS: Another try at fixing nvcc build MueLu: fix type handling in regionMG unit test ...
Looking at the wall-clock times for some of the CUDA builds like for:
the wall-clock times for running the tests for these builds did not go up by all that much from before 2019-11-22 to after 2019-11-22 (when PR #6632 was merged). From the naked eye it looks like it might have gone up 10% or 20% but not that much. We would have to to compute the means but that is hard to do because new tests are being added and expended so it is hard to make a fair comparison. Therefore, my initial observation is that turning down the parallel testing level to Closing as complete. |
Closing as complete for real :-) |
As described in kokkos/kokkos#2330 tests on GPUs may randomly fail with errors like:
This may be caused by the memory on the GPUs being overloaded when running tests.
It was recommended by the Kokkos developers to use a smaller parallel testing level.
I think is is appropriate to use
ctest -j4
since that will allow just one 4-process test to run at a time but will allow four 1-process tests (or two 2-process tests) to run at a time.The text was updated successfully, but these errors were encountered: