Panzer examples failures with new ATDM CUDA builds on hansen/shiller #2318

bartlettroscoe · 2018-03-02T00:18:40Z

Summary

CC: @trilinos/nox, @fryeguy52

Next Action Status

Panzer examples build as of 3/29/2018 and any remaining test/example failures are being addressed in other issues #2454 and #2471.

Description

The Panzer examples don't currently build whenn building with the ATDM CUDA build configuration. There are build failures shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&filtercount=1&showfilters=1&field1=buildname&compare1=63&value1=-atdm-

for the builds:

Trilinos-atdm-hansen-shiller-cuda-debug: https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3412693
Trilinos-atdm-hansen-shiller-cuda-opt: https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3412702

This shows a build failure for the file packages/panzer/adapters-stk/tutorial/siamCse17/mySourceTerm.cpp which the beginning looks like:

/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp(397): error: The closure type for a lambda ("lambda [](panzer::index_t)->void") cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is a 'extended lambda' and the flag --expt-extended-lambda is specified
          detected during:
            instantiation of "Kokkos::Impl::cuda_parallel_launch_local_memory" based on template argument <Kokkos::Impl::ParallelFor<lambda [](panzer::index_t)->void, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::Cuda>> 
(398): here
            instantiation of "Kokkos::Impl::CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0U, 0U>, false>::CudaParallelLaunch(const DriverType &, const dim3 &, const dim3 &, int, cudaStream_t) [with DriverType=Kokkos::Impl::ParallelFor<lambda [](panzer::index_t)->void, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::Cuda>]" 
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp(370): here
            instantiation of "void Kokkos::Impl::ParallelFor<FunctorType, Kokkos::RangePolicy<Traits...>, Kokkos::Cuda>::execute() const [with FunctorType=lambda [](panzer::index_t)->void, Traits=<Kokkos::Cuda>]" 
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Kokkos_Parallel.hpp(224): here
            instantiation of "void Kokkos::parallel_for(size_t, const FunctorType &, const std::string &) [with FunctorType=lambda [](panzer::index_t)->void]" 
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-opt/SRC_AND_BUILD/Trilinos/packages/panzer/adapters-stk/tutorial/siamCse17/mySourceTermImpl.hpp(143): here
            instantiation of "void MySourceTerm<EvalT, Traits>::evaluateFields(Traits::EvalData) [with EvalT=panzer::Traits::Residual, Traits=panzer::Traits]" 
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-opt/SRC_AND_BUILD/Trilinos/packages/panzer/adapters-stk/tutorial/siamCse17/mySourceTerm.cpp(56): here

...

6 errors detected in the compilation of "/tmp/tmpxft_000011a6_00000000-7_mySourceTerm.cpp1.ii".

The reset of the build failures are link failures for the executables:

PanzerAdaptersSTK_step01.exe
PanzerAdaptersSTK_me_main_driver.exe
PanzerAdaptersSTK_main_driver.exe
PanzerMiniEM_BlockPrec.exe

All of these show similar link failures that look like:

../../../../muelu/adapters/libmuelu-adapters.a(MueLu_RefMaxwell.cpp.o): In function `MueLu::RefMaxwell<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::compute()':
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZN5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE7computeEv[_ZN5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE7computeEv]+0x3f37): undefined reference to `Ifpack2::Hiptmair<Tpetra::RowMatrix<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::Hiptmair(Teuchos::RCP<Tpetra::RowMatrix<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, Teuchos::RCP<Tpetra::RowMatrix<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, Teuchos::RCP<Tpetra::RowMatrix<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&)'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZN5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE7computeEv[_ZN5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE7computeEv]+0x418d): undefined reference to `Ifpack2::Hiptmair<Tpetra::RowMatrix<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >::Hiptmair(Teuchos::RCP<Tpetra::RowMatrix<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, Teuchos::RCP<Tpetra::RowMatrix<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, Teuchos::RCP<Tpetra::RowMatrix<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&)'
../../../../muelu/adapters/libmuelu-adapters.a(MueLu_RefMaxwell.cpp.o): In function `MueLu::RefMaxwell<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::apply(Xpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Xpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >&, Teuchos::ETransp, double, double) const':
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x44b): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::MultiVector(Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&)'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x466): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::MultiVector(Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&)'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x48c): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x494): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x4e3): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::MultiVector(Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&)'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x4fe): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::MultiVector(Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&)'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x527): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x52f): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x632): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
tmpxft_00002f35_00000000-4_MueLu_RefMaxwell.cudafe1.cpp:(.text._ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd[_ZNK5MueLu10RefMaxwellIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5applyERKN6Xpetra11MultiVectorIdiiS6_EERSA_N7Teuchos7ETranspEdd]+0x63a): undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
collect2: error: ld returned 1 exit status

The reason that the EMPIRE build of Panzer does not show this build failure is that it enabled -DPanzer_ENABLE_TESTS=ON which does not enable Panzer examples. But the option -DTrilinos_ENABLE_TESTS=ON causes the default enabled of Panzer tests and examples (yes that is confusing behavior but that is what it is).

The options for addressing this are:

Fix the build failures for these CUDA builds, or
Disable Panzer examples for these specific ATDM CUDA builds and fix them later (if desired)

Steps to Reproduce:

The instructions to reproduce these build failures can be found starting at:

https://snl-wiki.sandia.gov/display/CoodinatedDevOpsATDM/ATDM+Builds+of+Trilinos

and clicking "Reproducing ATDM builds locally" which takes you to:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

Basically, on hansen or shiller, you just clone the Trilinos repo (with location depicted as $TRILINOS_DIR below), get on the develop branch. Then create a build directory and do the configure and build as:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-opt

$ cmake \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Panzer=ON \
  -DATDM_TWEAKS_FILES = \
  $TRILINOS_DIR

$ make -j16

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2018-03-02T00:31:54Z

@rppawlo,

If you don't want to bother with this for now, I can just set Panzer_ENABLE_EXAMPLES=OFF for these CUDA builds and make these build failures go way. The EMPIRE build of Panzer is not building or running these examples so I don't think we are required to have them building and running for CUDA.

But these examples do build and run just fine for the other GNU and Intel builds.

Let me know.

-Ross

bartlettroscoe · 2018-03-02T14:08:50Z

To unclutter the dashboard and since EMPIRE is not building or running the Panzer examples, I am going to add a commit to disable the Panzer examples for these builds. After I do that, I will update the "Steps to Reproduce" section to explain how to re-enable them.

…L-171) EMPIRE does not build or test these examples and they don't currently build (see #2318). This just removes the clutter on CDash while these things can be fixed offline so that we can focus on the remaining failures.

bartlettroscoe · 2018-03-02T15:36:37Z

I just pushed the commit 0be2ed4:

commit 0be2ed4a256af54d9c2dec5e45fdd2f617616e62
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Fri Mar 2 07:54:44 2018 -0700

    Disable Panzer examples for CUDA builds on hansen/shiller (#2318, TRIL-171)
    
    EMPIRE does not build or test these examples and they don't currently build
    (see #2318).  This just removes the clutter on CDash while these things can be
    fixed offline so that we can focus on the remaining failures.

A       cmake/std/atdm/shiller/tweaks/CUDA-DEBUG-CUDA.cmake
A       cmake/std/atdm/shiller/tweaks/CUDA-RELEASE-CUDA.cmake
A       cmake/std/atdm/shiller/tweaks/CUDA_COMMON_TWEAKS.cmake

It disables all of the examples. Therefore, when these CUDA builds run again and post to CDash, we should see the Panzer examples gone along with the build errors.

I updated the "Steps to Reproduce" to allow these examples to be re-enabled in case someone wants to fix this.

If you get these examples building, then can you please just remove the contents of the files:

cmake/std/atdm/shiller/tweaks/CUDA_COMMON_TWEAKS.cmake

and leave these files otherwise? That way, the next set of tweaks will just involve just adding set statements and not require recreating these files.

bartlettroscoe · 2018-03-03T17:24:45Z

My commit 0be2ed4 successfully disabled the Panzer examples that are failing to build with CUDA and this returned 100% passing Panzer tests for these CUDA builds as shown at:

*https://testing.sandia.gov/cdash/index.php?project=Trilinos&date=2018-03-03&subproject=Panzer&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=63&value1=Trilinos-atdm-&field2=buildname&compare2=63&value2=-cuda-

(scroll to the bottom to see.)

I think we need a status for Trilinos GitHub Issues called something like "disabled_waiting_to_be_fixed" or something like that.

…-171) The Panzer examples don't build on CUDA as noted in trilinos#2318.

* develop: (261 commits) Replace PHX_EVALUATOR_CLASS Macros (trilinos#2322) Change time-limit on 'ride' from 4 to 12 hours (TRIL-171) Disable Panzer examles for CUDA builds on 'ride' (trilinos#2318, TRIL-171) Move ATDMDevEnvUtils.cmake to utils/ subdir (TRIL-171) Change from ATDM_CONFIG_USE_MAKEIFLES to ATDM_CONFIG_USE_NINJA (TRIL-171) Add ATDM_CONFIG_KNOWN_HOSTNAME var for CTEST_SITE (TRIL-171) Misc improvements to checkin-test-atdm.sh (TRIL-171) Print the return code from 'ctest -S' command (TRIL-171) Tpetra: Sort and merge specializations for Tpetra::CrsGraph (trilinos#2354) Remove '#line 1' as workaround for Ninja + CUDA rebuilds (trilinos#2359) Added function to return the full vector of data in the indexer rather than elem by elem (trilinos#2355) Panzer: Fixing rotation matrix calculation in integration values Tpetra: Adding an additive Schwarz test checkin-test driver scripts for local testing ATDM builds of Trilinos Add list of all supported builds on ride (TRIL-171) Add list of all supported builds shiller (TRIL-171) Fix location of nvcc_wrapper and other fixups (TRIL-171) Improve the 'date' output and start and end (TRIL-171) Change name ATDM_HOSTNAME to ATDM_SYSTEM_NAME (TRIL-171) Print KOKKOS_ARCH to STDOUT (trilinos#1400) ... Conflicts: packages/shylu/shylu_node/tacho/src/TachoExp_NumericTools.hpp packages/shylu/shylu_node/tacho/unit-test/Tacho_TestCrsMatrixBase.hpp

bartlettroscoe · 2018-03-19T17:12:33Z

I was just informed by @rppawlo that the build and runtime issues that this issue covers actually show real problems for EMPIRE.

Therefore, I will remove the disable of these Panzer examples and let them run in the Trilinos-atdm-hansen-shiller-cuda-debug and Trilinos-atdm-hansen-shiller-cuda-opt builds.

Note that we have the builds Trilinos-atdm-hansen-shiller-cuda-debug-panzerandTrilinos-atdm-hansen-shiller-cuda-opt-panzer` that only build and run the Panzer tests but not the examples (same as the current EMPIRE configuration) and therefore these will keep passing to protect that build.

bartlettroscoe · 2018-03-19T17:41:15Z

@rppawlo, are the problems you mentioned with Panzer and CUDA on shiller only runtime problems or also build problems?

In any case, starting tomorrow morning, the buildsTrilinos-atdm-hansen-shiller-cuda-debug and Trilinos-atdm-hansen-shiller-cuda-opt posting to CDash will show any build or runtime failures with the Panzer examples.

NOTE: I am going to fix TriBITS so that -DPanzer_ENABLE_TESTS=ON will also, by default, enable -DPanzer_ENABLE_EXAMPLES=ON (see above. This is how it should have been the entire time and was an oversight in the TriBITS enable/disable logic.

rppawlo · 2018-03-19T17:50:25Z

They are runtime failures unless you enable hierarchic parallelism in sacado and phalanx, then it is build time. At this time, we don't need hierarchic.

bartlettroscoe · 2018-03-19T17:56:48Z

They are runtime failures unless you enable hierarchic parallelism in sacado and phalanx, then it is build time. At this time, we don't need hierarchic.

@rppawlo, if we find there are build failures for some Panzer examples, do you want to disable them for now or get this fixed in a shorter time-frame?

In any case, we should see how the Panzer example builds and testing looks tomorrow on CDash.

rppawlo · 2018-03-19T17:58:59Z

It's fine either way. We are actively working on it so hopefully even runtime issues may be fixed today.

bartlettroscoe · 2018-03-19T18:09:18Z

It's fine either way. We are actively working on it so hopefully even runtime issues may be fixed today.

@rppawlo, if there is anything different about the way you are configuring Trilinos or running on 'shiller' from what is listed in:

Trilinos/cmake/std/atdm/shiller/environment.sh

can you please comment on that here? That will include KOKKOS_ARCH (ATDM_CONFIG_KOKKOS_ARCH), the mpiexec flags (ATDM_CONFIG_MPI_POST_FLAG), and ctest parallelism (ATDM_CONFIG_CTEST_PARALLEL_LEVEL) or any other trick to configuring or running. We need to keep these configuration files under Trilinos/cmake/std/atdm/ synced up with what EMPIRE is using until we can switch EMPIRE to use this new shared configuration.

rppawlo · 2018-03-19T18:13:18Z

Currently using what is in the empire build scripts repo. no modifications.

bartlettroscoe · 2018-03-19T21:16:37Z

Good news. The current build of Panzer tests and examples with the CUDA builds on 'shiller' pass the build and only has a few failing tests with details shown below.

I just pushed the commit 2327743 that will allow the Panzer examples to build and test on 'hansen' tomorrow.

FAILED (NOT READY TO PUSH): Trilinos: shiller03

Mon Mar 19 13:36:26 MDT 2018

Enabled Packages: Panzer

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) cuda-opt => FAILED: passed=150,notpassed=3 => Not ready to push! (55.40 min)
2) cuda-debug => FAILED: passed=151,notpassed=2 => Not ready to push! (66.06 min)

*** Commits for repo :
  e7a5f38 Allow Panzer examples on main builds (#2319)

1) cuda-opt Results:
--------------------

  FAILED: Trilinos/cuda-opt: passed=150,notpassed=3
  
  Mon Mar 19 12:30:12 MDT 2018
  
  Enabled Packages: Panzer
  Hostname: shiller03
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/SHILLER/CHECKIN/cuda-opt
  
  CMake Cache Varibles: -GNinja -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=600.0 -GNinja -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake -DTrilinos_ENABLE_Panzer:BOOL=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=OFF
  Make Options: -j32 -k 99999
  CTest Options: -j 16
  
  Pull: Not Performed
  Configure: Passed (1.37 min)
  Build: Passed (30.23 min)
  Test: FAILED (23.80 min)
  
  98% tests passed, 3 tests failed out of 153
  
  Label Time Summary:
  Panzer    = 6837.95 sec (153 tests)
  
  Total Test time (real) = 1427.81 sec
  
  The following tests FAILED:
  	127 - PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 (Timeout)
  	131 - PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 (Timeout)
  	132 - PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 (Timeout)
  Errors while running CTest
  
  Total time for cuda-opt = 55.40 min


2) cuda-debug Results:
----------------------

  FAILED: Trilinos/cuda-debug: passed=151,notpassed=2
  
  Mon Mar 19 13:36:20 MDT 2018
  
  Enabled Packages: Panzer
  Hostname: shiller03
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/SHILLER/CHECKIN/cuda-debug
  
  CMake Cache Varibles: -GNinja -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=600.0 -GNinja -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake -DTrilinos_ENABLE_Panzer:BOOL=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=OFF
  Make Options: -j32 -k 99999
  CTest Options: -j 16
  
  Pull: Not Performed
  Configure: Passed (1.27 min)
  Build: Passed (41.70 min)
  Test: FAILED (23.08 min)
  
  99% tests passed, 2 tests failed out of 153
  
  Label Time Summary:
  Panzer    = 6563.56 sec (153 tests)
  
  Total Test time (real) = 1384.54 sec
  
  The following tests FAILED:
  	130 - PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2 (Timeout)
  	131 - PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 (Timeout)
  Errors while running CTest
  
  Total time for cuda-debug = 66.06 min

bartlettroscoe · 2018-03-21T13:16:07Z

The reenabled Panzer examples now all build now in the build Trilinos-atdm-hansen-shiller-cuda-opt after I re-enabled them yesterday as shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3451361

But there are four failing Panzer tests shown at:

https://testing.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=3451458

The tests are all timeouts:

@rppawlo,

Are these tests important for the work with ATDM and EMPIRE? Do they need to be fixed in short order?

Otherwise, other than a single failing Anasazi test (the Teuchos test should be fixed now), these Panzer test failures are the only thing blocking the promotion of the build Trilinos-atdm-hansen-shiller-cuda-opt to the "ATDM" CDash Track/Group so that it will send out CDash error emails.

rppawlo · 2018-03-22T12:00:26Z

I'm torn on this. These tests are important to verify order of accuracy of panzer discretizations. We can't really make the test smaller since it is run with a series of successively refined meshes. Can we temporarily disable just on this platform and I will take a look next week? Swamped with other cuda stuff today.

bartlettroscoe · 2018-03-22T14:14:16Z

I'm torn on this. These tests are important to verify order of accuracy of panzer discretizations. We can't really make the test smaller since it is run with a series of successively refined meshes. Can we temporarily disable just on this platform and I will take a look next week? Swamped with other cuda stuff today.

@rppawlo, you always only disable tests on the platforms where they fail. You never disable a test on a platform where it persistently passing. Therefore, we never loose any testing. We just remove persistently failing tests that condition people to seeing red and then not seeing new failures. Does that seem reasonable? (Once we get the CDash upgrade and update to CMake 3.10.x then we will see these disabled tests on CDash as well but CDash will not send out emails for these.)

From the analysis below looking at data on CDash, it looks like there might be a chance that these tests may be randomly timing out on the CUDA builds. Could this be due to the large startup time for the GPU computations as compared to just running just on the node threaded or not threaded? Anyway, it is interesting to see the runtimes for the test PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 for the various build on 'hansen' ('shiller') as shown at:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-hansen-shiller-&field2=testname&compare2=61&value2=PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3&field3=buildstarttime&compare3=84&value3=now

What do you make of that?

In any case, looking at this data, I think it might be worth experimenting with increasing the timeouts for these tests for CUDA builds and see what happens. Should I give that a try? Or, is a 10m runtime for these tests in a CUDA build just not worth it when the intel-opt-serial builds run this test in about40s to a minute.

Let me know.

P.S. The data below also suggests that some of these tests may have gotten a lot slower over the last month (from 5m to 9m in some cases). This might be a warning sign of a major performance regression somewhere in Trilinos.

DETAILED NOTES:

The tests that failed in these CUDA builds on 'hansen' ('shiller') last night are shown at:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=4&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=Passed&field3=testname&compare3=65&value3=Panzer&field4=buildname&compare4=63&value4=-hansen-shiller-

which shows:

Build Name	Test Name	Status	Time	Details
Trilinos-atdm-hansen-shiller-cuda-opt	PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4	Failed	10m 310ms	Completed (Timeout)
Trilinos-atdm-hansen-shiller-cuda-debug	PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2	Failed	10m 90ms	Completed (Timeout)
Trilinos-atdm-hansen-shiller-cuda-debug	PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3	Failed	10m 120m	Completed (Timeout)
Trilinos-atdm-hansen-shiller-cuda-opt	PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3	Failed	10m 70ms	Completed (Timeout)
Trilinos-atdm-hansen-shiller-cuda-opt	PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4	Failed	10m 110ms	Completed (Timeout)

The test PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3) just looks like it is timing out (i.e. hanging?) every time we run it on the 'hansen' ('shiller') CUDA opt and debug builds as shown at:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-hansen-shiller-cuda-&field2=testname&compare2=65&value2=PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3&field3=buildstarttime&compare3=84&value3=now

Other tests like PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 look to be timing out a little more irregularly as shown by:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-hansen-shiller-cuda-&field2=testname&compare2=65&value2=PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4&field3=buildstarttime&compare3=84&value3=now

And if you look at that above query, you see that yesterday that the test passed in 9m14s for the cuda-debug build but timed out at 10m for the cuda-opt build. And in late Feb that test passed for the cuda-opt build in under 3m every time. Does this suggest some type of race condition with the CUDA threads or something?

The test PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2 looks to be randomly timing out at 10m from looking at:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-hansen-shiller-cuda-&field2=testname&compare2=65&value2=PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2&field3=buildstarttime&compare3=84&value3=now

It looks like that test got a lot more expensive for the cuda-opt build from what it was in early March as shown in that build. Has something gotten slower in Kokkos, Tpetra or one of the other packages that these tests rely on in that time? Anyway, it looks like if we increase the timeout for the test PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2 it might actually pass.

Finally, looking at the test PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 in the query:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-hansen-shiller-cuda-&field2=testname&compare2=61&value2=PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4&field3=buildstarttime&compare3=84&value3=now

it looks like that test might be randomly taking longer to run and hitting the 10m time limit.

bartlettroscoe · 2018-03-22T23:49:23Z

@rppawlo,

I ran the Panzer tests that are timing out at 10m manually with a much longer timeout, and some of them actually take nearly 40 minutes to complete! (See details below.) But amazingly, these tests actually run faster wtih the cuda-debug build than with the cuda-opt build!

So, we can increse the timeout for these individaul tests to be like 45 minutes for CUDA builds and we can get them to pass. But do we really need to be running these tests for the CUDA build of Trilinos? Taking 40 minutes for single tests seems to be bit overkill. Do these tests really need to be run on CUDA as well? We are only talking about 4 tests here. They would continue to run in the other builds. But we might need to disable for the gnu OpenMP builds on 'ride' and 'white' as well, as shown at:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-21&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=status&compare1=62&value1=Passed&field2=testname&compare2=65&value2=Panzer&field3=buildname&compare3=65&value3=Trilinos-atdm-

In any case, I will create a PR to increase the timeouts for these tests for CUDA build and then at least if anyone runs them, they will pass. That will allow these to pass in nightly testing (taking up an extra 30 minutes in wall-clock time to complete these builds). Then later we can disable them for CUDA perhaps (or make them run faster if there is a desire to do that).

DETAILED NOTES (click to expand)

First, reproduce the failing tests for the cuda-opt and cuda-debug builds on 'shiller':

$ cd ~//Trilinos.base/BUILDS/SHILLER/CHECKIN/

$ srun ./checkin-test-atdm.sh cuda-opt cuda-debug --enable-packages=Panzer --local-do-all
...

FAILED (NOT READY TO PUSH): Trilinos: shiller04

Thu Mar 22 12:57:22 MDT 2018

Enabled Packages: Panzer

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) cuda-opt => FAILED: passed=150,notpassed=3 => Not ready to push! (71.27 min)
2) cuda-debug => FAILED: passed=149,notpassed=4 => Not ready to push! (70.44 min)


REQUESTED ACTIONS: FAILED

That showed the more detailed results:

1) cuda-opt Results:
--------------------

  FAILED: Trilinos/cuda-opt: passed=150,notpassed=3
  
  Thu Mar 22 11:46:43 MDT 2018
  
  Enabled Packages: Panzer
  Hostname: shiller04
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/SHILLER/CHECKIN/cuda-opt
 
  CMake Cache Varibles: ...
  Make Options: -j 32
  CTest Options: -j 16
  
  Pull: Not Performed
  Configure: Passed (1.50 min)
  Build: Passed (44.65 min)
  Test: FAILED (25.11 min)
  
  98% tests passed, 3 tests failed out of 153
  
  Label Time Summary:
  Panzer    = 6366.26 sec (153 tests)
  
  Total Test time (real) = 1506.71 sec
  
  The following tests FAILED:
  	127 - PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 (Timeout)
  	131 - PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 (Timeout)
  	133 - PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 (Timeout)
  Errors while running CTest
  
  Total time for cuda-opt = 71.27 min

and

2) cuda-debug Results:
----------------------

  FAILED: Trilinos/cuda-debug: passed=149,notpassed=4
  
  Thu Mar 22 12:57:16 MDT 2018
  
  Enabled Packages: Panzer
  Hostname: shiller04
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/SHILLER/CHECKIN/cuda-debug
  
  CMake Cache Varibles: ...
  Make Options: -j 32
  CTest Options: -j 16
  
  Pull: Not Performed
  Configure: Passed (1.39 min)
  Build: Passed (48.03 min)
  Test: FAILED (21.01 min)
  
  97% tests passed, 4 tests failed out of 153
  
  Label Time Summary:
  Panzer    = 5775.99 sec (153 tests)
  
  Total Test time (real) = 1260.82 sec
  
  The following tests FAILED:
  	127 - PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 (Timeout)
  	130 - PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2 (Timeout)
  	131 - PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 (Timeout)
  	132 - PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 (Timeout)
  Errors while running CTest
  
  Total time for cuda-debug = 70.44 min

All of the timeouts are after 10m. Therefore, let's increase the timeout to say 30 min for a CUDA build and see what happens ...

I will just run the tests manually increasing the timeout with:

$ cd cuda-opt/

$ . load-env.sh

$ time srun ctest -j16 --timeout 18000 -R "(PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4|PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2|PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3|PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4)"

srun: job 1000642 queued and waiting for resources
srun: job 1000642 has been allocated resources
Test project /home/rabartl/Trilinos.base/BUILDS/SHILLER/CHECKIN/cuda-opt
    Start 127: PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4
    Start 131: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3
    Start 130: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2
    Start 132: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4
1/5 Test #130: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2 .....   Passed  595.94 sec
    Start 135: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4-ConvTest
2/5 Test #135: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4-ConvTest ....   Passed  172.57 sec
3/5 Test #127: PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 ...   Passed  2258.43 sec
4/5 Test #132: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 .............   Passed  2260.80 sec
5/5 Test #131: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 .....   Passed  2267.07 sec

100% tests passed, 0 tests failed out of 5

Label Time Summary:
Panzer    = 7554.80 sec (5 tests)

Total Test time (real) = 2267.33 sec

real    55m50.247s
user    0m0.012s
sys     0m0.014s

Bingo, it passed! But note that thee runtimes of 2300 / 60 = 39 minutes! That means to be safe, we would likely need to set timeouts of 45 minutes for CUDA builds for these tests!

Now try the cuda-debug build:

$ cd ..

$ cd cuda-debug/

$ . load-env.sh

$ time srun ctest -j16 --timeout 18000 -R "(PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4|PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2|PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3|PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4)"

Test project /home/rabartl/Trilinos.base/BUILDS/SHILLER/CHECKIN/cuda-debug
    Start 127: PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4
    Start 130: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2
    Start 131: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3
    Start 132: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4
1/5 Test #130: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-2 .....   Passed  592.38 sec
    Start 135: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4-ConvTest
2/5 Test #132: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 .............   Passed  618.06 sec
3/5 Test #135: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4-ConvTest ....   Passed   53.77 sec
4/5 Test #127: PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 ...   Passed  798.28 sec
5/5 Test #131: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 .....   Passed  2353.93 sec

100% tests passed, 0 tests failed out of 5

Label Time Summary:
Panzer    = 4416.42 sec (5 tests)

Total Test time (real) = 2354.21 sec

real    39m14.615s
user    0m0.013s
sys     0m0.013s

Wow, the cuda-debug tests actually run faster than the cuda-opt tests! It looks like CMAKE_BUILD_TYPE is set correctly for these builds as shown from:

$ grep -nH CMAKE_BUILD_TYPE cuda-opt/configure.out 
cuda-opt/configure.out:421:-- CMAKE_BUILD_TYPE='RELEASE'

$ grep -nH CMAKE_BUILD_TYPE cuda-debug/configure.out 
cuda-debug/configure.out:421:-- CMAKE_BUILD_TYPE='DEBUG'

so go figure.

…linos#2318) The tests that have the timeouts raised take nearly 40 minutes to complete for a CUDA buld on hansen/shiller. Therefore, I set the timeout to 45 minutes to be safe.

bartlettroscoe · 2018-03-29T13:47:35Z

The Panzer examples have been re-enabled for a long time (since commit 2327743 was pushed to 'develop') and all of the failing tests have been addressed one way or the other in other issues or a being addressed in other issues. The only remaining failing Panzer tests as shown in:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-03-28&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=65&value1=Trilinos-atdm-&field2=status&compare2=62&value2=Passed&field3=testname&compare3=65&value3=Panzer

are the test PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4 (which is a KOKKOS_ENABLE_DEBUG=ON failure being addressed in #2471) and PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue (which is a segfault that is being addressed in #2454).

Therefore, there is nothing left to be done for this Issue so it is time to close it.

NOTE: Any Panzer tests that have been disabled as part this an other issues can be found by doing:

$ cd Trilinos/
$ find cmake/std/atdm/ -name "*.cmake" -exec grep -nH _DISABLE {} \; | grep Panzer

which currently returns:

cmake/std/atdm/shiller/tweaks/CUDA_COMMON_TWEAKS.cmake:2:ATDM_SET_ENABLE(PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3_DISABLE ON)
cmake/std/atdm/shiller/tweaks/INTEL-DEBUG-SERIAL.cmake:2:ATDM_SET_ENABLE(PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3_DISABLE ON)
cmake/std/atdm/ride/tweaks/CUDA_COMMON_TWEAKS.cmake:2:ATDM_SET_ENABLE(PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3_DISABLE ON)
cmake/std/atdm/ride/tweaks/GNU-DEBUG-OPENMP.cmake:7:ATDM_SET_ENABLE(PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3_DISABLE ON)
cmake/std/atdm/ride/tweaks/CUDA-RELEASE-CUDA.cmake:4:ATDM_SET_ENABLE(PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue_DISABLE ON)

bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: NOX client: ATDM Any issue primarily impacting the ATDM project labels Mar 2, 2018

bartlettroscoe added this to the Initial cleanup of new ATDM builds of Trilinos milestone Mar 2, 2018

bartlettroscoe changed the title ~~Panzer examples build failure with new ATDM CUDA builds on hansen/shiller~~ Panzer examples build failures with new ATDM CUDA builds on hansen/shiller Mar 2, 2018

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 9, 2018

Disable Panzer examles for CUDA builds on 'ride' (trilinos#2318, TRIL…

20b984d

…-171) The Panzer examples don't build on CUDA as noted in trilinos#2318.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 12, 2018

Disable Panzer examles for CUDA builds on 'ride' (trilinos#2318, TRIL…

3450efd

…-171) The Panzer examples don't build on CUDA as noted in trilinos#2318.

rppawlo added pkg: Panzer and removed pkg: NOX labels Mar 14, 2018

bartlettroscoe changed the title ~~Panzer examples build failures with new ATDM CUDA builds on hansen/shiller~~ Panzer examples failures with new ATDM CUDA builds on hansen/shiller Mar 21, 2018

bartlettroscoe mentioned this issue Mar 22, 2018

Select set of builds for initial mandatory auto PR testing process #2317

Closed

bartlettroscoe mentioned this issue Mar 23, 2018

Address expensive Panzer tests that timeout at 10 minutes in ATDM builds #2446

Closed

bartlettroscoe closed this as completed Mar 29, 2018

trilinos-autotester mentioned this issue Nov 28, 2018

Bring recent stk development from Sierra into Trilinos. #3938

Merged

9 tasks

bartlettroscoe added the PA: Discretizations Issues that fall under the Trilinos Discretizations Product Area label Nov 30, 2018

trilinos-autotester mentioned this issue Jun 18, 2021

Zoltan2 tpetra dep #9287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panzer examples failures with new ATDM CUDA builds on hansen/shiller #2318

Panzer examples failures with new ATDM CUDA builds on hansen/shiller #2318

bartlettroscoe commented Mar 2, 2018 •

edited

Loading

bartlettroscoe commented Mar 2, 2018

bartlettroscoe commented Mar 2, 2018

bartlettroscoe commented Mar 2, 2018

bartlettroscoe commented Mar 3, 2018

bartlettroscoe commented Mar 19, 2018

bartlettroscoe commented Mar 19, 2018

rppawlo commented Mar 19, 2018

bartlettroscoe commented Mar 19, 2018

rppawlo commented Mar 19, 2018

bartlettroscoe commented Mar 19, 2018

rppawlo commented Mar 19, 2018

bartlettroscoe commented Mar 19, 2018 •

edited

Loading

bartlettroscoe commented Mar 21, 2018 •

edited

Loading

rppawlo commented Mar 22, 2018

bartlettroscoe commented Mar 22, 2018

bartlettroscoe commented Mar 22, 2018

bartlettroscoe commented Mar 29, 2018

Panzer examples failures with new ATDM CUDA builds on hansen/shiller #2318

Panzer examples failures with new ATDM CUDA builds on hansen/shiller #2318

Comments

bartlettroscoe commented Mar 2, 2018 • edited Loading

Next Action Status

Description

Steps to Reproduce:

bartlettroscoe commented Mar 2, 2018

bartlettroscoe commented Mar 2, 2018

bartlettroscoe commented Mar 2, 2018

bartlettroscoe commented Mar 3, 2018

bartlettroscoe commented Mar 19, 2018

bartlettroscoe commented Mar 19, 2018

rppawlo commented Mar 19, 2018

bartlettroscoe commented Mar 19, 2018

rppawlo commented Mar 19, 2018

bartlettroscoe commented Mar 19, 2018

rppawlo commented Mar 19, 2018

bartlettroscoe commented Mar 19, 2018 • edited Loading

bartlettroscoe commented Mar 21, 2018 • edited Loading

rppawlo commented Mar 22, 2018

bartlettroscoe commented Mar 22, 2018

bartlettroscoe commented Mar 22, 2018

bartlettroscoe commented Mar 29, 2018

bartlettroscoe commented Mar 2, 2018 •

edited

Loading

bartlettroscoe commented Mar 19, 2018 •

edited

Loading

bartlettroscoe commented Mar 21, 2018 •

edited

Loading