Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures #6052

Closed
bartlettroscoe opened this issue Oct 5, 2019 · 7 comments
Assignees
Labels
ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

As described in kokkos/kokkos#2330 tests on GPUs may randomly fail with errors like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered <file-name>:<line-number>

This may be caused by the memory on the GPUs being overloaded when running tests.

It was recommended by the Kokkos developers to use a smaller parallel testing level.

I think is is appropriate to use ctest -j4 since that will allow just one 4-process test to run at a time but will allow four 1-process tests (or two 2-process tests) to run at a time.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests client: ATDM Any issue primarily impacting the ATDM project ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams labels Oct 5, 2019
@bartlettroscoe bartlettroscoe self-assigned this Oct 5, 2019
@bartlettroscoe
Copy link
Member Author

FYI: As shown in this query, the test:

  • MueLu_DriverTpetraSingleReduceCG_MPI_4

is randomly failing in the build:

  • Trilinos-atdm-waterman_cuda-9.2_shared_opt

showing the error:

Laplace2D: MxM: A x P
Laplace2D: MxM: P' x (AP) (implicit)
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/atdm-devops-admin/jenkins/waterman/Trilinos-atdm-waterman_cuda-9.2_shared_opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
Traceback functionality not available

[waterman3:24936] *** Process received signal ***
[waterman3:24936] Signal: Aborted (6)
[waterman3:24936] Signal code:  (-6)

Therefore, this could be due to overloading memory in this build.

Need to reduce parallel level on all builds to ctest -j4 before I create new ATDM Trilinos GitHub issues for failures like this.

@bartlettroscoe
Copy link
Member Author

FYI: All of the ATDM Trilinos test failures showing errors like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/atdm-devops-admin/jenkins/waterman/Trilinos-atdm-waterman_cuda-9.2_shared_opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120
Traceback functionality not available

[waterman7:46691] *** Process received signal ***
[waterman7:46691] Signal: Aborted (6)
[waterman7:46691] Signal code:  (-6)

in the last month are shown in this query for 2019-09-08 through 2019-10-08 which only impacts several 4-proc MueLu tests in the following 'waterman' builds:

  • Trilinos-atdm-waterman-cuda-9.2-debug
  • Trilinos-atdm-waterman-cuda-9.2-release-debug
  • Trilinos-atdm-waterman_cuda-9.2_shared_opt

and the single STK test STKUnit_tests_stk_ngp_test_utest_MPI_4 and only in CUDA builds on 'ride':

  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt

(This STK test failure was reported long ago in #4551.)

If these CUDA errors are really being caused by running out of GPU memory, then why are we not seeing large Panzer tests failing as well? Is it more likely that these errors are caused by UVM memory usage problems in MueLu and STK?

In any case, I will reduce the parallel test level for these CUDA builds from ctest -j8 to ctest -j4 and see what happens. NOTE: Since all of the failing MueLu tests shown in the above query are 4-proc MPI tests, running with ctest -j4 means that only one of thse 4-proc tests will be running at a time. Therefore, w.r.t. these tests, this should be equivalent to running ctest -j1.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 9, 2019
As described in trilinos#6052, it is possible that running with ctest -j8 could cause
out-of-memory errors on the GPU and when using UVM the errors are delayed and
hard to diagnose.  Therefore, to be safe, we need to drop to ctest -j4.  That
will cause all of the 4-proc MPI tests to run by themselves and will allow
four 1-proc MPI tests to run.  Hopefully this will eliminate any possibility
of running out of mememory on the GPU.

Hopefully Kitware can help us create a system where we can more effectively
use the GPUs as part of trilinos#2422.
@bartlettroscoe
Copy link
Member Author

PR is #6069

bartlettroscoe added a commit that referenced this issue Oct 9, 2019
…tes (#6052, #6069)

Merge to 'atdm-nightly' to make sure this gets run in ATDM Trilinos builds
tomorrow even if it can't get merged due to Trilinos PR testing issues.
@bartlettroscoe
Copy link
Member Author

FYI: I manually merged the topic branch in PR #6069 to 'atdm-nightly' in commit 2317c4d. Therefore, this will be running in ATDM Trilinos builds starting tomorrow.

trilinos-autotester added a commit that referenced this issue Oct 10, 2019
…ctest-j-4

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Reduce ctest -j8 to -j4 for all CUDA builds (#6052)
PR Author: bartlettroscoe
@bartlettroscoe
Copy link
Member Author

FYI: Looks like there have been some tests failing showing errors (here) like:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorMemoryAllocation): out of memory /home/atdm-devops-admin/jenkins/waterman/Trilinos-atdm-waterman_cuda-9.2_shared_opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:120

But looking across all ATDM Trilinos builds between 9/1/2019 and 10/10/2019 that showed this error in this query there were only two such failing tests:

Build Name Test Name Status Time Details Build Time Processors
Trilinos-atdm-waterman-_cuda-9.2_shared_opt MueLu_Aggregation_MPI_4 Failed 6s 980ms Completed (Failed) 2019-09-12T03:06:13 MDT 4
Trilinos-atdm-waterman_cuda-9.2_shared_opt MueLu_Driver_TogglePFactory-_semi_tent_line_Tpetra_MPI_4 Failed 29s 230ms Completed (Failed) 2019-09-19T03:06:57 MDT 4

Therefore, this error is not common at all. This move to ctest -j4 should eliminate these since these are 4-proc MPI tests and each of these will run by themselves after this change (which is active today).

jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Oct 11, 2019
…s:develop' (a9ace5b).

* trilinos-develop:
  Belos: treats cases where numResTests is zero
  Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast
  Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast
  Reduced from ctest -j8 to -j4 for all CUDA builds (trilinos#6052)
  Belos: solves Issue trilinos#6059
  Belos: solves Issue trilinos#6059
  Remove unused variable
  Remove commented out code.
  Address roundoff error in mesh centroid. Avoid a divide by zero error for the 1d mesh.
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Oct 11, 2019
…s:develop' (a9ace5b).

* trilinos-develop:
  Belos: treats cases where numResTests is zero
  Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast
  Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast
  Reduced from ctest -j8 to -j4 for all CUDA builds (trilinos#6052)
  Belos: solves Issue trilinos#6059
  Belos: solves Issue trilinos#6059
  Remove unused variable
  Remove commented out code.
  Address roundoff error in mesh centroid. Avoid a divide by zero error for the 1d mesh.
@bartlettroscoe bartlettroscoe changed the title Reduce all CUDA builds to use ctest -j4 to avoid out of memory due to flaky UVM Reduce all CUDA builds to use ctest -j4 to avoid flaky UVM out of memory failures Oct 14, 2019
searhein pushed a commit to searhein/Trilinos that referenced this issue Oct 15, 2019
…artition-of-unity

* 'develop' of https://github.com/searhein/Trilinos: (100 commits)
  Change initializer list order to make g++ happy
  Plumbing for BelosBlockCGSolMgr so that Assert Positive Definiteness flag gets passed to BelosCGIter
  Change Spack module from SuperLUDist 6.2.0 to 5.4.0 (CDOFA-66)
  Testing: Tpetra: make nightly build faster
  Update CrsMatrix_UnitTests4.cpp
  Tpetra: Fixes so that trilinos#6076 can pass
  MueLu: rebasing the gold files for phase 3 refactored outputs
  MueLu: refactor of Phase3 aggregation, see issue trilinos#5838
  KokkosKernels - trsm transpose does not solve the problem correctly.
  Belos: treats cases where numResTests is zero
  Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast
  Belos: removes a int cast warning and removes one unnecessary StatusTestCombo cast
  Reduced from ctest -j8 to -j4 for all CUDA builds (trilinos#6052)
  Belos: solves Issue trilinos#6059
  Belos: solves Issue trilinos#6059
  ML: Code cleanup to RefMaxwell
  ML: Removing supperfluous import constructor
  ML: Adding new support routine
  Xpetra: disable Map cloner test if no Tpetra deprecated
  Phalanx - wrong number of kokkos allocation arguments.
  ...
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Nov 22, 2019
To be safe, and to match with all of the CUDA builds, I reduced the default
parallel level for ctest to ctest -j4.  This should avoid problems with GPU
overloading.
trilinos-autotester added a commit that referenced this issue Nov 22, 2019
…ctest-j-4-doc

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Reduce to ctest -j4 in all documentation (#6052)
PR Author: bartlettroscoe
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Nov 23, 2019
…s:develop' (1683f23).

* trilinos-develop: (21 commits)
  MueLu: fixing the count of aggregated nodes in refactored Phase2a of aggregation, see trilinos#6325
  Teuchos StackedTimer: add unit test for 'proc_minmax'
  Teuchos StackedTimer: Print min time over all ranks that are active only
  Teuchos StackedTimer: add option to print rank with min/max time
  SEACAS: Bug fix since snapshot
  Reduce to ctest -j4 in all documentation (trilinos#6052)
  MueLu RefMaxwell: Print more matrix stats
  MueLu: gold file rebase and change of logic for issue trilinos#6269
  MueLu: refactor of Dirichlet conditions handling and changes in UncoupledAggregation see issue trilinos#6269
  Intrepid2: tweaks to OrientationTools::modifyBasisByOrientation() to allow reference-space inputs.
  Tempus: Add Cleanup of Error Tolerances.
  SEACAS: kluge to get new lib::fmt maybe working with nvcc
  Intrepid2: fixing issues with hierarchical parallelism policies, revealed when Trilinos is built with KOKKOS_ENABLE_DEPRECATED_CODE=OFF. (trilinos#6310)
  SEACAS: Another try at fixing nvcc build
  MueLu: fix type handling in regionMG unit test
  MueLu: remove debug output from regionMG unit test
  SEACAS: Attempt to fix CUDA compile errors
  Attempt to fix NVCC / INTEL compiler errors
  MueLu: fix misleading comments in regionMG unit tests
  Automatic snapshot commit from seacas at a34490f
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Nov 23, 2019
…s:develop' (1683f23).

* trilinos-develop: (21 commits)
  MueLu: fixing the count of aggregated nodes in refactored Phase2a of aggregation, see trilinos#6325
  Teuchos StackedTimer: add unit test for 'proc_minmax'
  Teuchos StackedTimer: Print min time over all ranks that are active only
  Teuchos StackedTimer: add option to print rank with min/max time
  SEACAS: Bug fix since snapshot
  Reduce to ctest -j4 in all documentation (trilinos#6052)
  MueLu RefMaxwell: Print more matrix stats
  MueLu: gold file rebase and change of logic for issue trilinos#6269
  MueLu: refactor of Dirichlet conditions handling and changes in UncoupledAggregation see issue trilinos#6269
  Intrepid2: tweaks to OrientationTools::modifyBasisByOrientation() to allow reference-space inputs.
  Tempus: Add Cleanup of Error Tolerances.
  SEACAS: kluge to get new lib::fmt maybe working with nvcc
  Intrepid2: fixing issues with hierarchical parallelism policies, revealed when Trilinos is built with KOKKOS_ENABLE_DEPRECATED_CODE=OFF. (trilinos#6310)
  SEACAS: Another try at fixing nvcc build
  MueLu: fix type handling in regionMG unit test
  MueLu: remove debug output from regionMG unit test
  SEACAS: Attempt to fix CUDA compile errors
  Attempt to fix NVCC / INTEL compiler errors
  MueLu: fix misleading comments in regionMG unit tests
  Automatic snapshot commit from seacas at a34490f
  ...
searhein pushed a commit to searhein/Trilinos that referenced this issue Nov 25, 2019
…ne-to-one

* 'develop' of https://github.com/trilinos/Trilinos: (38 commits)
  MueLu RefMaxwell: Fix bug in ParameterList interface
  tpetra:  converted constant to scalar_type; needed to enable compilation when scalar_type = std::complex<float>  (as noted in trilinos#6314)
  tpetra:  loosened tolerance that was too tight when scalar=float
  MueLu PerfUtils: Show ranks with min/max
  MueLu RefMaxwell: More fine grained control over sub-solve placement
  MueLu: fixing the count of aggregated nodes in refactored Phase2a of aggregation, see trilinos#6325
  Teuchos StackedTimer: add unit test for 'proc_minmax'
  Teuchos StackedTimer: Print min time over all ranks that are active only
  Teuchos StackedTimer: add option to print rank with min/max time
  SEACAS: Bug fix since snapshot
  Reduce to ctest -j4 in all documentation (trilinos#6052)
  MueLu RefMaxwell: Print more matrix stats
  MueLu: gold file rebase and change of logic for issue trilinos#6269
  MueLu: refactor of Dirichlet conditions handling and changes in UncoupledAggregation see issue trilinos#6269
  Intrepid2: tweaks to OrientationTools::modifyBasisByOrientation() to allow reference-space inputs.
  Tempus: Add Cleanup of Error Tolerances.
  SEACAS: kluge to get new lib::fmt maybe working with nvcc
  Intrepid2: fixing issues with hierarchical parallelism policies, revealed when Trilinos is built with KOKKOS_ENABLE_DEPRECATED_CODE=OFF. (trilinos#6310)
  SEACAS: Another try at fixing nvcc build
  MueLu: fix type handling in regionMG unit test
  ...
@bartlettroscoe
Copy link
Member Author

Looking at the wall-clock times for some of the CUDA builds like for:

the wall-clock times for running the tests for these builds did not go up by all that much from before 2019-11-22 to after 2019-11-22 (when PR #6632 was merged). From the naked eye it looks like it might have gone up 10% or 20% but not that much. We would have to to compute the means but that is hard to do because new tests are being added and expended so it is hard to make a fair comparison.

Therefore, my initial observation is that turning down the parallel testing level to ctest -j4 looks like it did not slow down the test suite by that much for most builds. Therefore, if this helps to improve the robustness of the test suite when running on the GPU, then is a win win situation.

Closing as complete.

@bartlettroscoe
Copy link
Member Author

Closing as complete for real :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

1 participant