-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panzer examples failures with new ATDM CUDA builds on hansen/shiller #2318
Comments
If you don't want to bother with this for now, I can just set But these examples do build and run just fine for the other GNU and Intel builds. Let me know. -Ross |
To unclutter the dashboard and since EMPIRE is not building or running the Panzer examples, I am going to add a commit to disable the Panzer examples for these builds. After I do that, I will update the "Steps to Reproduce" section to explain how to re-enable them. |
…L-171) EMPIRE does not build or test these examples and they don't currently build (see #2318). This just removes the clutter on CDash while these things can be fixed offline so that we can focus on the remaining failures.
I just pushed the commit 0be2ed4:
It disables all of the examples. Therefore, when these CUDA builds run again and post to CDash, we should see the Panzer examples gone along with the build errors. I updated the "Steps to Reproduce" to allow these examples to be re-enabled in case someone wants to fix this. If you get these examples building, then can you please just remove the contents of the files:
and leave these files otherwise? That way, the next set of tweaks will just involve just adding set statements and not require recreating these files. |
My commit 0be2ed4 successfully disabled the Panzer examples that are failing to build with CUDA and this returned 100% passing Panzer tests for these CUDA builds as shown at: (scroll to the bottom to see.) I think we need a status for Trilinos GitHub Issues called something like "disabled_waiting_to_be_fixed" or something like that. |
…-171) The Panzer examples don't build on CUDA as noted in trilinos#2318.
…-171) The Panzer examples don't build on CUDA as noted in trilinos#2318.
* develop: (261 commits) Replace PHX_EVALUATOR_CLASS Macros (trilinos#2322) Change time-limit on 'ride' from 4 to 12 hours (TRIL-171) Disable Panzer examles for CUDA builds on 'ride' (trilinos#2318, TRIL-171) Move ATDMDevEnvUtils.cmake to utils/ subdir (TRIL-171) Change from ATDM_CONFIG_USE_MAKEIFLES to ATDM_CONFIG_USE_NINJA (TRIL-171) Add ATDM_CONFIG_KNOWN_HOSTNAME var for CTEST_SITE (TRIL-171) Misc improvements to checkin-test-atdm.sh (TRIL-171) Print the return code from 'ctest -S' command (TRIL-171) Tpetra: Sort and merge specializations for Tpetra::CrsGraph (trilinos#2354) Remove '#line 1' as workaround for Ninja + CUDA rebuilds (trilinos#2359) Added function to return the full vector of data in the indexer rather than elem by elem (trilinos#2355) Panzer: Fixing rotation matrix calculation in integration values Tpetra: Adding an additive Schwarz test checkin-test driver scripts for local testing ATDM builds of Trilinos Add list of all supported builds on ride (TRIL-171) Add list of all supported builds shiller (TRIL-171) Fix location of nvcc_wrapper and other fixups (TRIL-171) Improve the 'date' output and start and end (TRIL-171) Change name ATDM_HOSTNAME to ATDM_SYSTEM_NAME (TRIL-171) Print KOKKOS_ARCH to STDOUT (trilinos#1400) ... Conflicts: packages/shylu/shylu_node/tacho/src/TachoExp_NumericTools.hpp packages/shylu/shylu_node/tacho/unit-test/Tacho_TestCrsMatrixBase.hpp
I was just informed by @rppawlo that the build and runtime issues that this issue covers actually show real problems for EMPIRE. Therefore, I will remove the disable of these Panzer examples and let them run in the Note that we have the builds Trilinos-atdm-hansen-shiller-cuda-debug-panzer |
@rppawlo, are the problems you mentioned with Panzer and CUDA on shiller only runtime problems or also build problems? In any case, starting tomorrow morning, the builds NOTE: I am going to fix TriBITS so that |
They are runtime failures unless you enable hierarchic parallelism in sacado and phalanx, then it is build time. At this time, we don't need hierarchic. |
@rppawlo, if we find there are build failures for some Panzer examples, do you want to disable them for now or get this fixed in a shorter time-frame? In any case, we should see how the Panzer example builds and testing looks tomorrow on CDash. |
It's fine either way. We are actively working on it so hopefully even runtime issues may be fixed today. |
@rppawlo, if there is anything different about the way you are configuring Trilinos or running on 'shiller' from what is listed in:
can you please comment on that here? That will include KOKKOS_ARCH ( |
Currently using what is in the empire build scripts repo. no modifications. |
Good news. The current build of Panzer tests and examples with the CUDA builds on 'shiller' pass the build and only has a few failing tests with details shown below. I just pushed the commit 2327743 that will allow the Panzer examples to build and test on 'hansen' tomorrow.
|
The reenabled Panzer examples now all build now in the build But there are four failing Panzer tests shown at: The tests are all timeouts: |= Name =|= Status =|= Time =|= Details =| Are these tests important for the work with ATDM and EMPIRE? Do they need to be fixed in short order? Otherwise, other than a single failing Anasazi test (the Teuchos test should be fixed now), these Panzer test failures are the only thing blocking the promotion of the build |
I'm torn on this. These tests are important to verify order of accuracy of panzer discretizations. We can't really make the test smaller since it is run with a series of successively refined meshes. Can we temporarily disable just on this platform and I will take a look next week? Swamped with other cuda stuff today. |
@rppawlo, you always only disable tests on the platforms where they fail. You never disable a test on a platform where it persistently passing. Therefore, we never loose any testing. We just remove persistently failing tests that condition people to seeing red and then not seeing new failures. Does that seem reasonable? (Once we get the CDash upgrade and update to CMake 3.10.x then we will see these disabled tests on CDash as well but CDash will not send out emails for these.) From the analysis below looking at data on CDash, it looks like there might be a chance that these tests may be randomly timing out on the CUDA builds. Could this be due to the large startup time for the GPU computations as compared to just running just on the node threaded or not threaded? Anyway, it is interesting to see the runtimes for the test What do you make of that? In any case, looking at this data, I think it might be worth experimenting with increasing the timeouts for these tests for CUDA builds and see what happens. Should I give that a try? Or, is a 10m runtime for these tests in a CUDA build just not worth it when the intel-opt-serial builds run this test in about40s to a minute. Let me know. P.S. The data below also suggests that some of these tests may have gotten a lot slower over the last month (from 5m to 9m in some cases). This might be a warning sign of a major performance regression somewhere in Trilinos. DETAILED NOTES: The tests that failed in these CUDA builds on 'hansen' ('shiller') last night are shown at: which shows:
The test Other tests like And if you look at that above query, you see that yesterday that the test passed in 9m14s for the cuda-debug build but timed out at 10m for the cuda-opt build. And in late Feb that test passed for the cuda-opt build in under 3m every time. Does this suggest some type of race condition with the CUDA threads or something? The test It looks like that test got a lot more expensive for the cuda-opt build from what it was in early March as shown in that build. Has something gotten slower in Kokkos, Tpetra or one of the other packages that these tests rely on in that time? Anyway, it looks like if we increase the timeout for the test Finally, looking at the test it looks like that test might be randomly taking longer to run and hitting the 10m time limit. |
I ran the Panzer tests that are timing out at 10m manually with a much longer timeout, and some of them actually take nearly 40 minutes to complete! (See details below.) But amazingly, these tests actually run faster wtih the So, we can increse the timeout for these individaul tests to be like 45 minutes for CUDA builds and we can get them to pass. But do we really need to be running these tests for the CUDA build of Trilinos? Taking 40 minutes for single tests seems to be bit overkill. Do these tests really need to be run on CUDA as well? We are only talking about 4 tests here. They would continue to run in the other builds. But we might need to disable for the gnu OpenMP builds on 'ride' and 'white' as well, as shown at: In any case, I will create a PR to increase the timeouts for these tests for CUDA build and then at least if anyone runs them, they will pass. That will allow these to pass in nightly testing (taking up an extra 30 minutes in wall-clock time to complete these builds). Then later we can disable them for CUDA perhaps (or make them run faster if there is a desire to do that). DETAILED NOTES (click to expand)First, reproduce the failing tests for the cuda-opt and cuda-debug builds on 'shiller':
That showed the more detailed results:
and
All of the timeouts are after 10m. Therefore, let's increase the timeout to say 30 min for a CUDA build and see what happens ... I will just run the tests manually increasing the timeout with:
Bingo, it passed! But note that thee runtimes of 2300 / 60 = 39 minutes! That means to be safe, we would likely need to set timeouts of 45 minutes for CUDA builds for these tests! Now try the
Wow, the
so go figure. |
…linos#2318) The tests that have the timeouts raised take nearly 40 minutes to complete for a CUDA buld on hansen/shiller. Therefore, I set the timeout to 45 minutes to be safe.
The Panzer examples have been re-enabled for a long time (since commit 2327743 was pushed to 'develop') and all of the failing tests have been addressed one way or the other in other issues or a being addressed in other issues. The only remaining failing Panzer tests as shown in: are the test Therefore, there is nothing left to be done for this Issue so it is time to close it. NOTE: Any Panzer tests that have been disabled as part this an other issues can be found by doing:
which currently returns:
|
Summary
CC: @trilinos/nox, @fryeguy52
Next Action Status
Panzer examples build as of 3/29/2018 and any remaining test/example failures are being addressed in other issues #2454 and #2471.
Description
The Panzer examples don't currently build whenn building with the ATDM CUDA build configuration. There are build failures shown at:
for the builds:
Trilinos-atdm-hansen-shiller-cuda-debug
: https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3412693Trilinos-atdm-hansen-shiller-cuda-opt
: https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3412702This shows a build failure for the file
packages/panzer/adapters-stk/tutorial/siamCse17/mySourceTerm.cpp
which the beginning looks like:The reset of the build failures are link failures for the executables:
PanzerAdaptersSTK_step01.exe
PanzerAdaptersSTK_me_main_driver.exe
PanzerAdaptersSTK_main_driver.exe
PanzerMiniEM_BlockPrec.exe
All of these show similar link failures that look like:
The reason that the EMPIRE build of Panzer does not show this build failure is that it enabled
-DPanzer_ENABLE_TESTS=ON
which does not enable Panzer examples. But the option-DTrilinos_ENABLE_TESTS=ON
causes the default enabled of Panzer tests and examples (yes that is confusing behavior but that is what it is).The options for addressing this are:
Steps to Reproduce:
The instructions to reproduce these build failures can be found starting at:
and clicking "Reproducing ATDM builds locally" which takes you to:
Basically, on
hansen
orshiller
, you just clone the Trilinos repo (with location depicted as$TRILINOS_DIR
below), get on thedevelop
branch. Then create a build directory and do the configure and build as:The text was updated successfully, but these errors were encountered: