-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra,Intrepid2: Adding RUN_SERIAL to tests #7476
Conversation
which were timing out on RIDE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a shame to force these tests to run one-at-at-time on systems where we can make them spread out over the multiple GPUs like 'waterman' and 'vortex' (see #6840).
But you can merge this for now and then I will work on #7112 so we can only set the RUN_SERIAL
property for these test for just 'ride' in the PR build.
@brian-kelley Thanks for doing this, but I think we are going around the issue instead of addressing it. If you turn this off you might have other tests randomly timing out. Also we probably want to test their execution on Cuda. It really seems to be an issue with work load and using ctest -jN. @trilinos/framework . |
@mperego I agree this is just a bandaid over the problem, but it will only be temporary until we can put e.g.
in |
I see, thanks. B.t.w. we could probably run serially these tests : |
@mperego Good call, 3 out of 4 of the Intrepid2_unit-test_Projection_Cuda_Test_InterpolationProjection tests took longer than 8 minutes:
|
On RIDE PR testing, these are in danger of timing out (taking 8-9 minutes, where the limit is 10 minutes).
Yeah, the HEX test is the really expensive one, followed by the TET one. The timings you see for the other two are really a consequence of running them in parallel. |
@mperego Should I only do HEX and TET in serial? |
Yes, thanks. |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: brian-kelley |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 6909 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 6720 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3_SERIAL # 5145 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_7.2.0 # 4992 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_8.3.0 # 1181 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_cuda_9.2 # 4477 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_clang_9.0.0 # 852 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_2 # 2600 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_3 # 2611 (click to expand)
|
@mperego You were right, running certain tests in serial just moved the timeout to a different test (MueLu_UnitTestsIntrepid2Tpetra_MPI_1). So we probably have to reduce the |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: brian-kelley |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
@brian-kelley I think I also was wrong.. the QUAD and TRI tests took also a long time, while they took less than 30s when I run it on ride serially. Maybe we should also run them serially, if we want to continue with this approach,.. and the |
So is there a timeframe where this is expected to be resolved? Looking at all of the comments for this PR and #7464, it sure seems like there isn't. Some ideas:
Otherwise, the only options I see are to start force-pushing PR's that are only being held up by timeouts, or shutting down PR testing on Ride altogether. |
@brian-kelley, I would propose to use the RUN_SERIAL option also for
This is of course a temporary solution but I think it will help the several PRs currently on hold. @etphipp suggestion makes sense to me, but it would probably need to be discussed longer, and we have been on hold since several days now. |
(for long-running tests)
@mperego OK, I added those RUN_SERIALs. @prwolfe has opened #7473 to move RIDE's PR testing to the Pascal nodes (rhel7F -> rhel7G), which have 2 GPUs each, and they are individually about twice as fast. Hopefully that will be a permanent fix for these timeouts, and if this gets merged first, we can just roll this back after #7473. |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: brian-kelley |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED Note: Testing will normally be attempted again in approx. 2 Hrs 30 Mins. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run. Pull Request Auto Testing has FAILED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.8.4 # 6920 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_intel_17.0.1 # 6731 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_4.9.3_SERIAL # 5156 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_7.2.0 # 5003 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_gcc_8.3.0 # 1192 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_cuda_9.2 # 4488 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_clang_9.0.0 # 863 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_2 # 2609 (click to expand)
Console Output (last 100 lines) : Trilinos_pullrequest_python_3 # 2620 (click to expand)
|
It seems you added RUN_SERIAL to |
@mperego Oops, fixed. I left it on the 4-rank as well, just in case it's slow too. @trilinos/muelu There's probably something that can be done to the Intrepid2+Tpetra+MueLu unit tests to speed them up, since 8m36s is spent in just 2 of the unit tests: |
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection Is Not Necessary for this Pull Request. |
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
Using Repos:
Pull Request Author: brian-kelley |
Thanks @brian-kelley, fingers crossed. |
Good. Cuda build is clean. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as a temporary fix. Thanks!
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_8.3.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_clang_9.0.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ mperego ]! |
Status Flag 'Pull Request AutoTester' - AutoMerge IS ENABLED, but the Label AT: AUTOMERGE is not set. Either set Label AT: AUTOMERGE or manually merge the PR... |
@brian-kelley can we merge this? |
These were timing out on RIDE PR testing. Even with Kokkos_ENABLE_DEBUG=ON, the Tpetra tests should take about 17 seconds each and the Intrepid2 test 0.8 seconds. So this won't increase the total time by much.
@trilinos/tpetra
@trilinos/intrepid2
Motivation
See test timeouts/discussion at #7464
Stakeholder Feedback
Testing