Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures #6052

bartlettroscoe · 2019-10-05T20:32:00Z

As described in kokkos/kokkos#2330 tests on GPUs may randomly fail with errors like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered <file-name>:<line-number>

This may be caused by the memory on the GPUs being overloaded when running tests.

It was recommended by the Kokkos developers to use a smaller parallel testing level.

I think is is appropriate to use ctest -j4 since that will allow just one 4-process test to run at a time but will allow four 1-process tests (or two 2-process tests) to run at a time.

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2019-10-08T16:49:39Z

FYI: As shown in this query, the test:

MueLu_DriverTpetraSingleReduceCG_MPI_4

is randomly failing in the build:

Trilinos-atdm-waterman_cuda-9.2_shared_opt

showing the error:

Laplace2D: MxM: A x P
Laplace2D: MxM: P' x (AP) (implicit)
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/atdm-devops-admin/jenkins/waterman/Trilinos-atdm-waterman_cuda-9.2_shared_opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
Traceback functionality not available

[waterman3:24936] *** Process received signal ***
[waterman3:24936] Signal: Aborted (6)
[waterman3:24936] Signal code:  (-6)

Therefore, this could be due to overloading memory in this build.

Need to reduce parallel level on all builds to ctest -j4 before I create new ATDM Trilinos GitHub issues for failures like this.

bartlettroscoe · 2019-10-08T19:55:12Z

FYI: All of the ATDM Trilinos test failures showing errors like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/atdm-devops-admin/jenkins/waterman/Trilinos-atdm-waterman_cuda-9.2_shared_opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
Traceback functionality not available

[waterman7:46691] *** Process received signal ***
[waterman7:46691] Signal: Aborted (6)
[waterman7:46691] Signal code:  (-6)

in the last month are shown in this query for 2019-09-08 through 2019-10-08 which only impacts several 4-proc MueLu tests in the following 'waterman' builds:

Trilinos-atdm-waterman-cuda-9.2-debug
Trilinos-atdm-waterman-cuda-9.2-release-debug
Trilinos-atdm-waterman_cuda-9.2_shared_opt

and the single STK test STKUnit_tests_stk_ngp_test_utest_MPI_4 and only in CUDA builds on 'ride':

Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt

(This STK test failure was reported long ago in #4551.)

If these CUDA errors are really being caused by running out of GPU memory, then why are we not seeing large Panzer tests failing as well? Is it more likely that these errors are caused by UVM memory usage problems in MueLu and STK?

In any case, I will reduce the parallel test level for these CUDA builds from ctest -j8 to ctest -j4 and see what happens. NOTE: Since all of the failing MueLu tests shown in the above query are 4-proc MPI tests, running with ctest -j4 means that only one of thse 4-proc tests will be running at a time. Therefore, w.r.t. these tests, this should be equivalent to running ctest -j1.

As described in trilinos#6052, it is possible that running with ctest -j8 could cause out-of-memory errors on the GPU and when using UVM the errors are delayed and hard to diagnose. Therefore, to be safe, we need to drop to ctest -j4. That will cause all of the 4-proc MPI tests to run by themselves and will allow four 1-proc MPI tests to run. Hopefully this will eliminate any possibility of running out of mememory on the GPU. Hopefully Kitware can help us create a system where we can more effectively use the GPUs as part of trilinos#2422.

bartlettroscoe · 2019-10-09T16:50:38Z

PR is #6069

…tes (#6052, #6069) Merge to 'atdm-nightly' to make sure this gets run in ATDM Trilinos builds tomorrow even if it can't get merged due to Trilinos PR testing issues.

bartlettroscoe · 2019-10-09T18:09:11Z

FYI: I manually merged the topic branch in PR #6069 to 'atdm-nightly' in commit 2317c4d. Therefore, this will be running in ATDM Trilinos builds starting tomorrow.

…ctest-j-4 Automatically Merged using Trilinos Pull Request AutoTester PR Title: Reduce ctest -j8 to -j4 for all CUDA builds (#6052) PR Author: bartlettroscoe

bartlettroscoe · 2019-10-10T14:33:18Z

FYI: Looks like there have been some tests failing showing errors (here) like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorMemoryAllocation): out of memory /home/atdm-devops-admin/jenkins/waterman/Trilinos-atdm-waterman_cuda-9.2_shared_opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120

But looking across all ATDM Trilinos builds between 9/1/2019 and 10/10/2019 that showed this error in this query there were only two such failing tests:

Build Name	Test Name	Status	Time	Details	Build Time	Processors
Trilinos-atdm-waterman-_cuda-9.2_shared_opt	MueLu_Aggregation_MPI_4	Failed	6s 980ms	Completed (Failed)	2019-09-12T03:06:13 MDT	4
Trilinos-atdm-waterman_cuda-9.2_shared_opt	MueLu_Driver_TogglePFactory-_semi_tent_line_Tpetra_MPI_4	Failed	29s 230ms	Completed (Failed)	2019-09-19T03:06:57 MDT	4

Therefore, this error is not common at all. This move to ctest -j4 should eliminate these since these are 4-proc MPI tests and each of these will run by themselves after this change (which is active today).

…s:develop' (a9ace5b). * trilinos-develop: Belos: treats cases where numResTests is zero Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Reduced from ctest -j8 to -j4 for all CUDA builds (trilinos#6052) Belos: solves Issue trilinos#6059 Belos: solves Issue trilinos#6059 Remove unused variable Remove commented out code. Address roundoff error in mesh centroid. Avoid a divide by zero error for the 1d mesh.

…artition-of-unity * 'develop' of https://github.com/searhein/Trilinos: (100 commits) Change initializer list order to make g++ happy Plumbing for BelosBlockCGSolMgr so that Assert Positive Definiteness flag gets passed to BelosCGIter Change Spack module from SuperLUDist 6.2.0 to 5.4.0 (CDOFA-66) Testing: Tpetra: make nightly build faster Update CrsMatrix_UnitTests4.cpp Tpetra: Fixes so that trilinos#6076 can pass MueLu: rebasing the gold files for phase 3 refactored outputs MueLu: refactor of Phase3 aggregation, see issue trilinos#5838 KokkosKernels - trsm transpose does not solve the problem correctly. Belos: treats cases where numResTests is zero Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast Reduced from ctest -j8 to -j4 for all CUDA builds (trilinos#6052) Belos: solves Issue trilinos#6059 Belos: solves Issue trilinos#6059 ML: Code cleanup to RefMaxwell ML: Removing supperfluous import constructor ML: Adding new support routine Xpetra: disable Map cloner test if no Tpetra deprecated Phalanx - wrong number of kokkos allocation arguments. ...

To be safe, and to match with all of the CUDA builds, I reduced the default parallel level for ctest to ctest -j4. This should avoid problems with GPU overloading.

…ctest-j-4-doc Automatically Merged using Trilinos Pull Request AutoTester PR Title: Reduce to ctest -j4 in all documentation (#6052) PR Author: bartlettroscoe

…s:develop' (1683f23). * trilinos-develop: (21 commits) MueLu: fixing the count of aggregated nodes in refactored Phase2a of aggregation, see trilinos#6325 Teuchos StackedTimer: add unit test for 'proc_minmax' Teuchos StackedTimer: Print min time over all ranks that are active only Teuchos StackedTimer: add option to print rank with min/max time SEACAS: Bug fix since snapshot Reduce to ctest -j4 in all documentation (trilinos#6052) MueLu RefMaxwell: Print more matrix stats MueLu: gold file rebase and change of logic for issue trilinos#6269 MueLu: refactor of Dirichlet conditions handling and changes in UncoupledAggregation see issue trilinos#6269 Intrepid2: tweaks to OrientationTools::modifyBasisByOrientation() to allow reference-space inputs. Tempus: Add Cleanup of Error Tolerances. SEACAS: kluge to get new lib::fmt maybe working with nvcc Intrepid2: fixing issues with hierarchical parallelism policies, revealed when Trilinos is built with KOKKOS_ENABLE_DEPRECATED_CODE=OFF. (trilinos#6310) SEACAS: Another try at fixing nvcc build MueLu: fix type handling in regionMG unit test MueLu: remove debug output from regionMG unit test SEACAS: Attempt to fix CUDA compile errors Attempt to fix NVCC / INTEL compiler errors MueLu: fix misleading comments in regionMG unit tests Automatic snapshot commit from seacas at a34490f ...

…ne-to-one * 'develop' of https://github.com/trilinos/Trilinos: (38 commits) MueLu RefMaxwell: Fix bug in ParameterList interface tpetra: converted constant to scalar_type; needed to enable compilation when scalar_type = std::complex<float> (as noted in trilinos#6314) tpetra: loosened tolerance that was too tight when scalar=float MueLu PerfUtils: Show ranks with min/max MueLu RefMaxwell: More fine grained control over sub-solve placement MueLu: fixing the count of aggregated nodes in refactored Phase2a of aggregation, see trilinos#6325 Teuchos StackedTimer: add unit test for 'proc_minmax' Teuchos StackedTimer: Print min time over all ranks that are active only Teuchos StackedTimer: add option to print rank with min/max time SEACAS: Bug fix since snapshot Reduce to ctest -j4 in all documentation (trilinos#6052) MueLu RefMaxwell: Print more matrix stats MueLu: gold file rebase and change of logic for issue trilinos#6269 MueLu: refactor of Dirichlet conditions handling and changes in UncoupledAggregation see issue trilinos#6269 Intrepid2: tweaks to OrientationTools::modifyBasisByOrientation() to allow reference-space inputs. Tempus: Add Cleanup of Error Tolerances. SEACAS: kluge to get new lib::fmt maybe working with nvcc Intrepid2: fixing issues with hierarchical parallelism policies, revealed when Trilinos is built with KOKKOS_ENABLE_DEPRECATED_CODE=OFF. (trilinos#6310) SEACAS: Another try at fixing nvcc build MueLu: fix type handling in regionMG unit test ...

bartlettroscoe · 2020-01-03T20:11:26Z

Looking at the wall-clock times for some of the CUDA builds like for:

the wall-clock times for running the tests for these builds did not go up by all that much from before 2019-11-22 to after 2019-11-22 (when PR #6632 was merged). From the naked eye it looks like it might have gone up 10% or 20% but not that much. We would have to to compute the means but that is hard to do because new tests are being added and expended so it is hard to make a fair comparison.

Therefore, my initial observation is that turning down the parallel testing level to ctest -j4 looks like it did not slow down the test suite by that much for most builds. Therefore, if this helps to improve the robustness of the test suite when running on the GPU, then is a win win situation.

Closing as complete.

bartlettroscoe · 2020-01-03T20:15:07Z

Closing as complete for real :-)

bartlettroscoe self-assigned this Oct 5, 2019

bartlettroscoe mentioned this issue Oct 9, 2019

Reduce ctest -j8 to -j4 for all CUDA builds (#6052) #6069

Merged

bartlettroscoe changed the title ~~Reduce all CUDA builds to use ctest -j4 to avoid out of memory due to flaky UVM~~ Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures Oct 14, 2019

bartlettroscoe mentioned this issue Nov 22, 2019

Reduce to ctest -j4 in all documentation (#6052) #6332

Merged

bartlettroscoe closed this as completed Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures #6052

Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures #6052

bartlettroscoe commented Oct 5, 2019

bartlettroscoe commented Oct 8, 2019

bartlettroscoe commented Oct 8, 2019

bartlettroscoe commented Oct 9, 2019

bartlettroscoe commented Oct 9, 2019

bartlettroscoe commented Oct 10, 2019

bartlettroscoe commented Jan 3, 2020

bartlettroscoe commented Jan 3, 2020

Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures #6052

Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures #6052

Comments

bartlettroscoe commented Oct 5, 2019

bartlettroscoe commented Oct 8, 2019

bartlettroscoe commented Oct 8, 2019

bartlettroscoe commented Oct 9, 2019

bartlettroscoe commented Oct 9, 2019

bartlettroscoe commented Oct 10, 2019

bartlettroscoe commented Jan 3, 2020

bartlettroscoe commented Jan 3, 2020