-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use RESOURCE_GROUPS property of CTest #6840
Use RESOURCE_GROUPS property of CTest #6840
Conversation
See also: kokkos/kokkos#2765. |
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
3328a6e
to
db84d56
Compare
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
2066cc2
to
500fba0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KyleFromKitware, thanks for putting this together. I have just a few small requests.
Can I request the following changes (consistent with my above comments in this review)?
-
Move the setting of the vars TRIBITS_TEST_EXTRA_ENVIRONMENT and TRIBITS_RESOURCES_PER_PROCESS into TriBITS proper and guard with
IF (TPL_ENABLE_CUDA)
. (This does not makes sense for non-CUDA builds.) -
Create a new function in
TribitsAddTestHelpers.cmake
called something likeTRIBITS_PRIVATE_ADD_CTEST_RESOURCE_USAGE_PROPERTIES()
and then use that to eliminate duplication related setting theENVIRONMENT
andRESOURCE_GROUPS
properties. -
Put code to generating CTest resource file into a new module
TribitsGenerateResourceSpecFile.cmake
. (This will allow for reuse ofTRIBITS_ADD[_ADVANCED]_TEST()
for non-TriBITS projects. Also, that file `TribitsGlobalMacros.cmake has gotten out of control in size.) -
Change the name of the generated resource from
resources.json
toctest_resources.json
and move it to the base build directory. (That way it will be more clear what this is and the user will see it clearly. We don't want to hide this away.) -
Add a comment describing why we need the code related to the command
GET_PROPERTY(CTEST_RESOURCE_SPEC_FILE_TYPE CACHE CTEST_RESOURCE_SPEC_FILE PROPERTY TYPE)
. (It is not clear to me.) -
Make sure and keep the TriBITS commits separate from the Kokkos commits and there should be no commits outside of
cmake/tribits
andpackages/kokkos
(which there will not be after doing the first item above). This will make it easy to apply these same commits to the TriBITS repo.
CMakeLists.txt
Outdated
SET(TRIBITS_TEST_EXTRA_ENVIRONMENT CTEST_KOKKOS_DEVICE_TYPE=gpus) | ||
SET(TRIBITS_RESOURCES_PER_PROCESS gpus:1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look at this more, but I think we need to move this into TriBITS proper (need to think about the right place) and we need to wrap this inside of:
IF (TPL_ENABLE_CUDA)
...
ENDIF()
It might be nice to keep this inside of the TribitsAddTestHelper.cmake
file so that TRIBITS_ADD[_ADVANCED]_TEST()
can be used by non-TriBITS projects.
ELSEIF(DEFINED CACHE{CTEST_RESOURCE_SPEC_FILE}) | ||
GET_PROPERTY(CTEST_RESOURCE_SPEC_FILE_TYPE CACHE CTEST_RESOURCE_SPEC_FILE PROPERTY TYPE) | ||
IF (CTEST_RESOURCE_SPEC_FILE_TYPE STREQUAL "INTERNAL") | ||
ADVANCED_SET(CTEST_RESOURCE_SPEC_FILE "" | ||
CACHE INTERNAL | ||
"Resource spec file for CTest." | ||
) | ||
ENDIF() | ||
ENDIF() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about what is going on with this GET_PROPERTY() command. What is significant about "INTERNAL" here? Can we add a comment below this that explains this idiom and what this is trying to accomplish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, CTEST_RESOURCE_SPEC_FILE
is intended to be set by the user with a -D
argument to cmake
. However, since we are generating the file ourselves, it doesn't make sense to also provide the option to the user to change it, so I made it an INTERNAL
option to tell CMake about it while hiding it from the user in ccmake
and cmake-gui
. If the project then gets reconfigured with ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE=OFF
, we check if the cache entry exists and is INTERNAL
, meaning it was previously generated by ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE=ON
. If so, that means it was not manually specified by the user, so we remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, could we not just give a default value to CTEST_RESOURCE_SPEC_FILE
if ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE=ON
but not hide the cache variable CTEST_RESOURCE_SPEC_FILE
? If the full path given in CTEST_RESOURCE_SPEC_FILE
is not the default value then ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE=ON
would be ignored with a "NOTE" printed to the screen.
You would have the following just below the definition of ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE
in the macro TRIBITS_DEFINE_GLOBAL_OPTIONS_AND_DEFINE_EXTRA_REPOS()
:
SET(CTEST_RESOURCE_SPEC_FILE_DOC_EXTRA "")
IF (${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE)
SET(CTEST_RESOURCE_SPEC_FILE_DEFAULT ${CMAKE_BINARY_DIR}/ctest_resources.json)
IF ("${CTEST_RESOURCE_SPEC_FILE}" STREQUAL "")
SET(CTEST_RESOURCE_SPEC_FILE_DOC_EXTRA
" This file is autogenerated by default since ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE=${${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE}!" )
ENDIF()
ELSE()
SET(CTEST_RESOURCE_SPEC_FILE_DEFAULT "")
ENDIF()
ADVANCED_SET(CTEST_RESOURCE_SPEC_FILE
"${CTEST_RESOURCE_SPEC_FILE_DEFAULT}"
CACHE FILEPATH
"Resource spec file for CTest.${CTEST_RESOURCE_SPEC_FILE_DOC_EXTRA}"
)
Then when you auto-generate that file in TribitsProjectImpl.cmake
you would have:
IF (${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE)
IF ("${CTEST_RESOURCE_SPEC_FILE}" STREQUAL "${CTEST_RESOURCE_SPEC_FILE_DEFAULT}")
TRIBITS_GENERATE_RESOURCE_SPEC_FILE()
ELSE()
MESSAGE("NOTE: The test resource file CTEST_RESOURCE_SPEC_FILE='${CTEST_RESOURCE_SPEC_FILE}'"
" will not be auto-generated even through"
" ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE=${${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE}"
" because its location does not match the default"
" location '${CTEST_RESOURCE_SPEC_FILE_DEFAULT}'."
" If you want to auto-generate this file, please clear CTEST_RESOURCE_SPEC_FILE and
" reconfigure or create that file on your own and clear"
" ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE."
)
ENDIF()
ELSE()
# Don't autogenerate the resource file
ENDIF()
That way, we don't have to hide the variable CTEST_RESOURCE_SPEC_FILE
and we allow the documentation for this variable to describe what is going on and what the relationship is between the variables CTEST_RESOURCE_SPEC_FILE
and ${PROJECT_NAME}_AUTOGENERATE_TEST_RESOURCE_FILE
.
How does hat sound?
NOTE: There is complex enough logic here that we should figure our how to create some automated tests for this in TirBITS to sustain this. This may be hard to do because of the need to set TPL_ENABLE_CUDA=ON
. I can't set that in TriBITS automated testing but I may be able to find to mock this and test the different use cases for the generation and and usage of this file. I will worry about how to add automated testing for this in TriBITS to make sure we can sustain this long term. But I can do that after we get some initial testing on CUDA/GPU systems to make sure we are on the right track with all of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds like a good idea. As far as testing it, we can use a mock variable and then put the logic inside IF (TPL_ENABLE_CUDA OR TPL_ENABLE_CUDA_MOCK)
. Does that sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds like a good idea. As far as testing it, we can use a mock variable and then put the logic inside IF (TPL_ENABLE_CUDA OR TPL_ENABLE_CUDA_MOCK). Does that sound good?
Right, something like that would work. Then we can set up different scenarios and make sure that the right resource file gets generated and the right test properties are getting set. We can work on that once I can do some testing on 'waterman'.
My intention in doing everything here first was that you could test it quickly to see if it helps any. I do of course plan to move the TriBITS stuff into the proper TriBITS repo - I will open a separate PR over there and make the changes you asked. Have my changes in Kokkos been merged here yet? |
@KyleFromKitware, I think yes because the PR kokkos/kokkos#2765 was already merged. Can you verify that? |
@KyleFromKitware, okay, that is fine. We can do more cleanups later once we do some testing with this. So I guess this is ready to test on CUDA systems? |
Yes, let's try it. |
I don't see my Kokkos changes in Trilinos |
@KyleFromKitware, this is what I would expect. Kokkos does regular releases (about ever 3 months) where they promote from Kokkos 'develop' to 'master' and then they snapshot Kokkos 'master' to Trilinos 'develop' (through a Trilinos PR). In between proper/offical Kokkos updates to Trilinos, we can make tweaks to Kokkos to address urgent needs. As long as equivalent changes are being made in Kokkos 'develop' and Trilinos 'develop', all will be fine on the next Kokkos snapshot to Trilinos 'develop'. Does that make sense? |
Yes, so we expect to not see the Kokkos changes here yet. Does that mean you want me to include the accepted Kokkos patch here as well so it doesn't have to wait for Kokkos |
@KyleFromKitware, there is enough nuance here that we should likely discuss this briefly over Skype just to make sure we are on the same page. I will send you an short meeting invite. |
6daff4a
to
336991c
Compare
@bartlettroscoe, per our discussion, I have cherry-picked the merged version of the Kokkos changes into this branch and rebased onto the latest |
336991c
to
370f6f0
Compare
This fixes the tests KokkosCore_UnitTest_DefaultInit_<x>_MPI_1 that fail when running with ctest GPU allocation feature that fail becuase they don't run on GPU device 0 (see #3040). Note: Since get_ctest_gpu() returns 0 if CTest has not provided anything, it is safe to always call it. This new function get_gpu() should really be unit tested on its own.
Extract and use get_gpu() (#3040, trilinos/Trilinos#6840)
Got bit by a randomly failing test (see #3276 (comment)) likely unrelated to any changes in this PR :-( I applied |
Status Flag 'Pull Request AutoTester' - User Requested Retest - Label AT: RETEST will be reset after testing. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: KyleFromKitware |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 6755 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 6567 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3_SERIAL # 4992 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_7.2.0 # 4839 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_8.3.0 # 1029 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_cuda_9.2 # 4329 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_clang_9.0.0 # 704 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_2 # 2469 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_3 # 2480 (click to expand)
|
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: KyleFromKitware |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ bartlettroscoe ]! |
Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged |
Merge on Pull Request# 6840: IS A SUCCESS - Pull Request successfully merged |
…s:develop' (cd0d4eb). * trilinos-develop: Piro: cleaning/refactoring of steady adjoint sensitivities moved computation of adjoint sensitivities from Piro::NOXSolver into Piro::SteadyStateSolver ATDM: Set several CUDA disables (trilinos#6329, trilinos#6799, trilinos#7090) Tempus: Add TimeEvent Object amesos2: fix tests, examples with basker, cleanup amesos2/basker: fix memory leak Phalanx: remove all use of cuda uvm ATDM: ride: Spread out work over GPUs (trilinos#2422) Kokkos: Switch to use Kokkos::Cuda().cuda_device() for expected_device (kokkos/kokkos#3040, trilinos#6840) Kokkos: Extract and use get_gpu() (kokkos/kokkos#3040, trilinos#6840) ATDM: Update documentation for updated 'waterman' env (trilinos#2422) ATDM: waterman: Reduce from ctest -j4 to -j2 (trilinos#2422) ATDM: waterman: Use cmake 3.17.2 and ctest resource limits for GPUs (trilinos#2422) Allow pointing to a tribits outside of Trilinos (trilinos#2422) Automatic snapshot commit from tribits at 39a9591
…s:develop' (cd0d4eb). * trilinos-develop: Piro: cleaning/refactoring of steady adjoint sensitivities moved computation of adjoint sensitivities from Piro::NOXSolver into Piro::SteadyStateSolver ATDM: Set several CUDA disables (trilinos#6329, trilinos#6799, trilinos#7090) Tempus: Add TimeEvent Object amesos2: fix tests, examples with basker, cleanup amesos2/basker: fix memory leak Phalanx: remove all use of cuda uvm ATDM: ride: Spread out work over GPUs (trilinos#2422) Kokkos: Switch to use Kokkos::Cuda().cuda_device() for expected_device (kokkos/kokkos#3040, trilinos#6840) Kokkos: Extract and use get_gpu() (kokkos/kokkos#3040, trilinos#6840) ATDM: Update documentation for updated 'waterman' env (trilinos#2422) ATDM: waterman: Reduce from ctest -j4 to -j2 (trilinos#2422) ATDM: waterman: Use cmake 3.17.2 and ctest resource limits for GPUs (trilinos#2422) Allow pointing to a tribits outside of Trilinos (trilinos#2422) Automatic snapshot commit from tribits at 39a9591
This fixes the tests KokkosCore_UnitTest_DefaultInit_<x>_MPI_1 that fail when running with ctest GPU allocation feature that fail becuase they don't run on GPU device 0 (see kokkos#3040). Note: Since get_ctest_gpu() returns 0 if CTest has not provided anything, it is safe to always call it. This new function get_gpu() should really be unit tested on its own.
…s#3040, trilinos/Trilinos#6840) This seems to also fix the KokkosCore_UnitTest_DefaultInit_<x>_MPI_1 tests that where failing when running with ctest GPU allocation feature that fail because they don't run on GPU device 0 (see kokkos#3040).
Extract and use get_gpu() (#3040, trilinos/Trilinos#6840)
This will spread out the work to the different GPUs and help speed up the test suite and avoid some overloading. This is being driven by:
The changes to Kokkos in this PR have already been merged in kokkos/kokkos#2765.
Some more work needs to get done before this is ready for testing and merging (see https://gitlab.kitware.com/snl/project-1/-/issues/100).
The motivating issue is #7248
The matching TriBITS changes are in PR TriBITSPub/TriBITS#319.
To build Trilinos on this branch with the matching TriBITS, do:
Then to build Trilinos using the other TriBITS implementation, configure with:
Once we have TriBITS changes 100% done (including automated tests and documentation), we will merge TriBITSPub/TriBITS#319 to 'mastser' and then snapshot TriBITS to the Trilinos branch 'tribits_github_snapshot' and merge that branch into the Trilinos PR for the final test and merge to the 'develop' branch.