Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RUN_SERIAL test property to many ATDM Trilinos tests on CUDA builds to address timeouts #7112

Closed
bartlettroscoe opened this issue Apr 2, 2020 · 7 comments
Assignees
Labels
ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates client: ATDM Any issue primarily impacting the ATDM project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: bug The primary issue is a bug in Trilinos code or tests type: enhancement Issue is an enhancement, not a bug

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Apr 2, 2020

I need to add option <fullTestName>_SET_RUN_SERIAL=[ON|OFF] to TRIBITS_ADD[ADVANCED_]TEST() to add the CTest RUN_SERIAL test property and use to mark many CUDA tests in Trilinos GitHub issues #7090, #6805, #6804, #6801

But first, we need to finish up and merge #6840 to first get the work spread out over the GPUs to see if that resolves some of these timeouts.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests type: enhancement Issue is an enhancement, not a bug client: ATDM Any issue primarily impacting the ATDM project ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates labels Apr 2, 2020
@bartlettroscoe bartlettroscoe self-assigned this Apr 2, 2020
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 3, 2020

The issues with timeouts in the ATDM Trilinos builds was resolved by using the new ctest GPU allocation approach in #6840 so we don't need to add support for <fullTestName>_SET_RUN_SERIAL for that ATDM Trilinos builds. But we may need to add it for the Trilinos PR builds (see #7464).

@bartlettroscoe
Copy link
Member Author

NOTE: Using the CTest resource allocation approach does not help on 'ride' for some reason so we are still getting quite a few timeouts there. Therefore, I will go ahead and implement this option and apply RUN_SREIAL to several of the tests on 'ride' to avoid those timeouts.

bartlettroscoe added a commit to bartlettroscoe/TriBITS that referenced this issue Sep 18, 2020
TriBITSPub#173, trilinos/Trilinos#7112)

Turns out that TRIBITS_ADD_TEST() had bad logic for reading in
<fullTestName>_SET_RUN_SERIAL and <fullTestName>_SET_DISABLED_AND_MSG
when the test name is modifed by setting TPL_ENABLE_MPI=ON or passing in
NAME_POSTFIX or POSTFIX_AND_ARGS_<IDX> arguments.

These are obvious use cases that I missed and they are needed for
trilinos/Trilinos#7112 and in general, obviously.

This means that <fullTestName>_SET_DISABLED_AND_MSG was not implemented
correctly for TRIBITS_ADD_TEST() as part of TriBITSPub#173.  Turns out the driving
use case for that was using TRIBITS_ADD_ADVACED_TEST() so we never noticed
the bug.

The next commit will fix this (which I already prototyped).
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Sep 19, 2020
…trilinos#7112)

This makes the testing workflow with TriBITS so much easiler.
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Sep 19, 2020
…nos/Trilinos#7112)

Build/Test Cases Summary
Enabled Packages:
Enabled all Packages
0) MPI_DEBUG => passed: passed=381,notpassed=0 (2.15 min)
1) SERIAL_RELEASE => passed: passed=381,notpassed=0 (1.91 min)
2) MPI_DEBUG_CMake-3.17.0 => passed: passed=386,notpassed=0 (2.23 min)
3) SERIAL_RELEASE_CMake-3.17.0 => passed: passed=386,notpassed=0 (2.18 min)
Other local commits for this build/test group: 1b00a3c, 32e3fbc, 92c8ac0, a648f77, c42eb66, ef76b3c, afb7f2e
bartlettroscoe added a commit that referenced this issue Sep 19, 2020
Origin repo remote tracking branch: 'github/master'
Origin repo remote repo URL: 'github = git@github.com:TriBITSPub/TriBITS.git'

At commit:

commit cee1980d053ec2c26301d2389c8b0a677fa262fe
Author:  Roscoe A. Bartlett <rabartl@sandia.gov>
Date:    Sat Sep 19 07:40:49 2020 -0600
Summary: Merge remote-tracking branch 'rab-github/tril-7112-run-serial' (#7112)

This represents the changes in the TriBITS PRs:

* TriBITSPub/TriBITS#327 : Fix install for MacOSX (#7881)

* TriBITSPub/TriBITS#328 : Add support for
  <fullTestName>_SET_RUN_SERIAL=[ON|OFF] (#7112)
bartlettroscoe added a commit to bartlettroscoe/TriBITS that referenced this issue Sep 22, 2020
…nos/Trilinos#7112)

Here, I make some of the asserts for the RUN_SERIAL tests hopefully a little
more clear (trilinos/Trilinos#7112).

As part of this I also reorgainized the unit tests some to group them better.
I broke off the RUN_SERIAL tests for TAAT() into thier own function.

However, more needs to be done to improve the organization of these tests and
make them more independent.  (What we really need is a proper unit test
harness for CMake code but that will take some work to create.)
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Sep 22, 2020
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Sep 22, 2020
bartlettroscoe added a commit to bartlettroscoe/TriBITS that referenced this issue Sep 23, 2020
bartlettroscoe added a commit to bartlettroscoe/TriBITS that referenced this issue Sep 23, 2020
trilinos-autotester added a commit that referenced this issue Sep 23, 2020
…run-serial

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Add <fullTestName>_SET_RUN_SERIAL and use on 'ride' (#7112), fix MaxOSX install (#7881)
PR Author: bartlettroscoe
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Sep 23, 2020
Build/Test Cases Summary
Enabled Packages:
Enabled all Packages
0) MPI_DEBUG => passed: passed=382,notpassed=0 (2.05 min)
1) SERIAL_RELEASE => passed: passed=382,notpassed=0 (1.99 min)
2) MPI_DEBUG_CMake-3.17.0 => passed: passed=387,notpassed=0 (2.63 min)
3) SERIAL_RELEASE_CMake-3.17.0 => passed: passed=387,notpassed=0 (2.26 min)
Other local commits for this build/test group: 2850245
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Sep 23, 2020

The updates from PR #8063 seem to be working well setting the RUN_SERIAL property. See evidence in the PR #8063 (which was merged to 'atdm-nightly' a few days ago). Also see no timeouts in the build Trilinos-atdm-white-ride-gnu-7.2.0-openmp-debug the last 3 days as shown here:

@bartlettroscoe
Copy link
Member Author

@brian-kelley, do you mind if we remove the hard-coded RUN_SERIAL options added with the commit 81746c0 and instead only set RUN_SERIAL for the CUDA builds on 'ride'? On the other machines like 'vortex', the CTest resource allocation approach (see commit 292e9ac and ATDV-333) of ensuring only a single test runs on each of the 4 GPUs runs at a time works really well and will allow all 4 of the TpetraCore_gemm_m_eq_<X>_MPI_1 tests, for example, to run at the same time on the 4 GPUs on ATS-2 'vortex'.

Note, we should leave RUN_SERIAL on any real PERFORMANCE tests since those always need to run by themselves on an otherwise unloaded machine.

@brian-kelley
Copy link
Contributor

@bartlettroscoe Yes, that's fine. I think adding RUN_SERIAL to non-performance tests was a temporary hack for RIDE.

jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Sep 24, 2020
…s:develop' (4557faa).

* trilinos-develop:
  Piro: fixing issue where Trapezoidal Rule Solver was not using the (trilinos#8090)
  Tpetra: fix warnings
  Tests: Fixing geminga test
  Phalanx: fix for gcc 5/6 lambda bug on cuda
  Automatic snapshot commit from tribits at 1ed3811
  ATDM: ride: Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_ARS_233_MPI_1 RUN_SERIAL (trilinos#7112, trilinos#8063)
  Fixed name
  Make case for output names consistent
  Automatic snapshot commit from tribits at 362e27d
  MueLu: free fine comp operator as soon as possible
  Automatic snapshot commit from tribits at cee1980
  Allow pointing to a different TriBITS in ctest -S driver for Trilinos (trilinos#7112)
  ATDM: ride: Add RUN_SERIAL for several timing out tests (trilinos#7112)
  Zoltan2: Refactor directory to use Teuchos comm
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Sep 24, 2020
…s:develop' (4557faa).

* trilinos-develop:
  Piro: fixing issue where Trapezoidal Rule Solver was not using the (trilinos#8090)
  Tpetra: fix warnings
  Tests: Fixing geminga test
  Phalanx: fix for gcc 5/6 lambda bug on cuda
  Automatic snapshot commit from tribits at 1ed3811
  ATDM: ride: Tempus_IMEX_RK_Partitioned_Staggered_FSA_Partitioned_IMEX_RK_ARS_233_MPI_1 RUN_SERIAL (trilinos#7112, trilinos#8063)
  Fixed name
  Make case for output names consistent
  Automatic snapshot commit from tribits at 362e27d
  MueLu: free fine comp operator as soon as possible
  Automatic snapshot commit from tribits at cee1980
  Allow pointing to a different TriBITS in ctest -S driver for Trilinos (trilinos#7112)
  ATDM: ride: Add RUN_SERIAL for several timing out tests (trilinos#7112)
  Zoltan2: Refactor directory to use Teuchos comm
@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Dec 19, 2021
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Config Issues that are specific to the ATDM configuration settings ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates client: ATDM Any issue primarily impacting the ATDM project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. type: bug The primary issue is a bug in Trilinos code or tests type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

2 participants