Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU reg tests crashing on Summit #816

Closed
PaulMullowney opened this issue Mar 4, 2021 · 5 comments
Closed

GPU reg tests crashing on Summit #816

PaulMullowney opened this issue Mar 4, 2021 · 5 comments
Assignees

Comments

@PaulMullowney
Copy link
Contributor

The issue occurs in the methods run_face_elem_algorithm, run_face_elem_par_reduce, and run_face_elem_algorithm_nosimd in include/ngp_utils/NgpLoopUtils.h.

In particular, calls like the following crash:
const int nodesPerFace = nodes_per_entity(faceDataNGP, METype::FACE);

in include/ngp_utils/NgpMEUtils.h at line 67, i.e.
Kokkos::parallel_reduce(
1, KOKKOS_LAMBDA(int, int& n) {
n = me->nodesPerElement_;
}, npe);

However, if you replace nodes_per_entity(faceDataNGP, METype::FACE) call with
const int nodesPerFace = nodes_per_entity(faceDataNGP);
which calls the API above under the hood, the code doesn't crash.

Valgrind, cuda-memcheck do not show anything interesting in particular. I've been looking at this for days and I cannot find the issue. Perhaps another set of eyes might help.

@PaulMullowney
Copy link
Contributor Author

PaulMullowney commented Mar 5, 2021

  1. No Hypre. ablNeutralNGPTrilinos crashes
#0  0x0000000000000000 in ?? ()
#1  0x00000000110f552c in __nv_hdl_wrapper_t (in=..., this=<optimized out>) at nvcc_internal_extended_lambda_implementation:237
#2  CudaFunctorAdapter (f_=..., this=<optimized out>) at /ccs/proj/cfd116/shreyas/summit/exawind-2020-08/install/gcc-cuda10/trilinos-2021-03-03/include/Cuda/Kokkos_Cuda_Parallel.hpp:2699
#3  functor (functor_in=...) at /ccs/proj/cfd116/shreyas/summit/exawind-2020-08/install/gcc-cuda10/trilinos-2021-03-03/include/Cuda/Kokkos_Cuda_Parallel.hpp:2838
#4  Kokkos::Impl::ParallelReduceAdaptor<Kokkos::RangePolicy<Kokkos::Cuda>, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<int (*)(sierra::nalu::ElemDataRequestsGPU const&, sierra::nalu::METype), &(int sierra::nalu::nodes_per_entity<sierra::nalu::ElemDataRequestsGPU>(sierra::nalu::ElemDataRequestsGPU const&, sierra::nalu::METype)), 1u>, void (int, int&), sierra::nalu::MasterElement*>, int>::execute(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Cuda> const&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<int (*)(sierra::nalu::ElemDataRequestsGPU const&, sierra::nalu::METype), &(int sierra::nalu::nodes_per_entity<sierra::nalu::ElemDataRequestsGPU>(sierra::nalu::ElemDataRequestsGPU const&, sierra::nalu::METype)), 1u>, void (int, int&), sierra::nalu::MasterElement*> const&, int&) (label=..., policy=..., functor=..., return_value=@0x7fffdd0addf0: 0)
    at /ccs/proj/cfd116/shreyas/summit/exawind-2020-08/install/gcc-cuda10/trilinos-2021-03-03/include/Kokkos_Parallel_Reduce.hpp:868
#5  0x0000000011248c40 in parallel_reduce<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<int (*)(const sierra::nalu::ElemDataRequestsGPU&, sierra::nalu::METype), sierra::nalu::nodes_per_entity<sierra::nalu::ElemDataRequestsGPU>, 1>, void(int, int&), sierra::nalu::MasterElement*>, int> (return_value=<optimized out>, functor=..., policy=<optimized out>) at /ccs/proj/cfd116/shreyas/summit/exawind-2020-08/install/gcc-cuda10/trilinos-2021-03-03/include/Kokkos_Parallel_Reduce.hpp:1030
#6  sierra::nalu::nodes_per_entity<sierra::nalu::ElemDataRequestsGPU> (dataReq=..., meType=sierra::nalu::FACE) at ../include/ngp_utils/NgpMEUtils.h:77
#7  0x0000000011258bd8 in sierra::nalu::nalu_ngp::run_face_elem_algorithm<stk::mesh::DeviceMesh, sierra::nalu::nalu_ngp::FieldManager, sierra::nalu::ElemDataRequests, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (sierra::nalu::WallFuncGeometryAlg<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::*)(), &sierra::nalu::WallFuncGeometryAlg<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::execute, 1u>, void (sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh>&), unsigned int const, unsigned int const, sierra::nalu::MasterElement*, sierra::nalu::MasterElement*, bool, double, sierra::nalu::nalu_ngp::impl::ElemFieldOp<stk::mesh::DeviceMesh, stk::mesh::DeviceField<double, stk::mesh::EmptyNgpFieldSyncDebugger>, sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh> > const, sierra::nalu::nalu_ngp::impl::NodeFieldOp<stk::mesh::DeviceMesh, stk::mesh::DeviceField<double, stk::mesh::EmptyNgpFieldSyncDebugger>, sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh> > const, sierra::nalu::nalu_ngp::impl::NodeFieldOp<stk::mesh::DeviceMesh, stk::mesh::DeviceField<double, stk::mesh::EmptyNgpFieldSyncDebugger>, sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh> > const> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, sierra::nalu::nalu_ngp::MeshInfo<stk::mesh::DeviceMesh, sierra::nalu::nalu_ngp::FieldManager> const&, sierra::nalu::ElemDataRequests const&, sierra::nalu::ElemDataRequests const&, stk::mesh::Selector const&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (sierra::nalu::WallFuncGeometryAlg<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::*)(), &sierra::nalu::WallFuncGeometryAlg<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::execute, 1u>, void (sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh>&), unsigned int const, unsigned int const, sierra::nalu::MasterElement*, sierra::nalu::MasterElement*, bool, double, sierra::nalu::nalu_ngp::impl::ElemFieldOp<stk::mesh::DeviceMesh, stk::mesh::DeviceField<double, stk::mesh::EmptyNgpFieldSyncDebugger>, sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh> > const, sierra::nalu::nalu_ngp::impl::NodeFieldOp<stk::mesh::DeviceMesh, stk::mesh::DeviceField<double, stk::mesh::EmptyNgpFieldSyncDebugger>, sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh> > const, sierra::nalu::nalu_ngp::impl::NodeFieldOp<stk::mesh::DeviceMesh, stk::mesh::DeviceField<double, stk::mesh::EmptyNgpFieldSyncDebugger>, sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh> > const>) (algName=..., meshInfo=..., faceDataReqs=..., elemDataReqs=..., sel=..., algorithm=...) at ../include/ngp_utils/NgpLoopUtils.h:531
#8  0x0000000011268b7c in sierra::nalu::WallFuncGeometryAlg<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::execute (this=0x5ab78100) at ../src/ngp_algorithms/WallFuncGeometryAlg.C:98
#9  0x00000000113cf3fc in sierra::nalu::NgpAlgDriver::execute (this=0x5ac09d20) at ../src/ngp_algorithms/NgpAlgDriver.C:48
#10 0x000000001071bd6c in compute_geometry (this=0x6c634850) at ../src/Realm.C:2453
#11 sierra::nalu::Realm::initialize_prolog (this=0x6c634850) at ../src/Realm.C:527
#12 0x000000001072f33c in sierra::nalu::Realms::initialize_prolog (this=<optimized out>) at ../src/Realms.C:77
#13 0x000000001074c574 in sierra::nalu::Simulation::initialize (this=0x7fffdd0ecfe8) at ../src/Simulation.C:148
#14 0x00000000100ffa18 in main (argc=<optimized out>, argv=<optimized out>) at ../nalu.C:177

@PaulMullowney
Copy link
Contributor Author

PaulMullowney commented Mar 5, 2021

  1. With Hypre: oversetRotCylNGPHypre
#0  0x0000000000000000 in ?? ()
#1  0x0000000011165fac in __nv_hdl_wrapper_t (in=..., this=<optimized out>) at nvcc_internal_extended_lambda_implementation:237
#2  CudaFunctorAdapter (f_=..., this=<optimized out>) at /ccs/proj/cfd116/shreyas/summit/exawind-2020-08/install/gcc-cuda10/trilinos-2021-03-03/include/Cuda/Kokkos_Cuda_Parallel.hpp:2699
#3  functor (functor_in=...) at /ccs/proj/cfd116/shreyas/summit/exawind-2020-08/install/gcc-cuda10/trilinos-2021-03-03/include/Cuda/Kokkos_Cuda_Parallel.hpp:2838
#4  Kokkos::Impl::ParallelReduceAdaptor<Kokkos::RangePolicy<Kokkos::Cuda>, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<int (*)(sierra::nalu::ElemDataRequestsGPU const&, sierra::nalu::METype), &(int sierra::nalu::nodes_per_entity<sierra::nalu::ElemDataRequestsGPU>(sierra::nalu::ElemDataRequestsGPU const&, sierra::nalu::METype)), 1u>, void (int, int&), sierra::nalu::MasterElement*>, int>::execute(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::RangePolicy<Kokkos::Cuda> const&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<int (*)(sierra::nalu::ElemDataRequestsGPU const&, sierra::nalu::METype), &(int sierra::nalu::nodes_per_entity<sierra::nalu::ElemDataRequestsGPU>(sierra::nalu::ElemDataRequestsGPU const&, sierra::nalu::METype)), 1u>, void (int, int&), sierra::nalu::MasterElement*> const&, int&) (label=..., policy=..., functor=..., return_value=@0x7fffc069b600: 0)
    at /ccs/proj/cfd116/shreyas/summit/exawind-2020-08/install/gcc-cuda10/trilinos-2021-03-03/include/Kokkos_Parallel_Reduce.hpp:868
#5  0x00000000113f9a70 in parallel_reduce<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<int (*)(const sierra::nalu::ElemDataRequestsGPU&, sierra::nalu::METype), sierra::nalu::nodes_per_entity<sierra::nalu::ElemDataRequestsGPU>, 1>, void(int, int&), sierra::nalu::MasterElement*>, int> (return_value=<optimized out>, functor=..., policy=<optimized out>) at /ccs/proj/cfd116/shreyas/summit/exawind-2020-08/install/gcc-cuda10/trilinos-2021-03-03/include/Kokkos_Parallel_Reduce.hpp:1030
#6  sierra::nalu::nodes_per_entity<sierra::nalu::ElemDataRequestsGPU> (dataReq=..., meType=sierra::nalu::FACE) at ../include/ngp_utils/NgpMEUtils.h:77
#7  0x000000001140a6c4 in sierra::nalu::nalu_ngp::run_face_elem_algorithm<stk::mesh::DeviceMesh, sierra::nalu::nalu_ngp::FieldManager, sierra::nalu::ElemDataRequests, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (sierra::nalu::NodalGradPOpenBoundary<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::*)(), &sierra::nalu::NodalGradPOpenBoundary<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::execute, 1u>, void (sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh>&), sierra::nalu::MasterElement*, unsigned int const, unsigned int const, unsigned int const, unsigned int const, unsigned int const, unsigned int const, bool const, bool const, sierra::nalu::MasterElement*, double const, sierra::nalu::nalu_ngp::impl::NodeFieldOp<stk::mesh::DeviceMesh, stk::mesh::DeviceField<double, stk::mesh::EmptyNgpFieldSyncDebugger>, sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh> > const> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, sierra::nalu::nalu_ngp::MeshInfo<stk::mesh::DeviceMesh, sierra::nalu::nalu_ngp::FieldManager> const&, sierra::nalu::ElemDataRequests const&, sierra::nalu::ElemDataRequests const&, stk::mesh::Selector const&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (sierra::nalu::NodalGradPOpenBoundary<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::*)(), &sierra::nalu::NodalGradPOpenBoundary<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::execute, 1u>, void (sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh>&), sierra::nalu::MasterElement*, unsigned int const, unsigned int const, unsigned int const, unsigned int const, unsigned int const, unsigned int const, bool const, bool const, sierra::nalu::MasterElement*, double const, sierra::nalu::nalu_ngp::impl::NodeFieldOp<stk::mesh::DeviceMesh, stk::mesh::DeviceField<double, stk::mesh::EmptyNgpFieldSyncDebugger>, sierra::nalu::nalu_ngp::FaceElemSimdData<stk::mesh::DeviceMesh> > const>) (algName=..., meshInfo=..., faceDataReqs=..., elemDataReqs=..., sel=..., algorithm=...) at ../include/ngp_utils/NgpLoopUtils.h:531
#8  0x00000000114193ac in sierra::nalu::NodalGradPOpenBoundary<sierra::nalu::AlgTraitsFaceElem<sierra::nalu::AlgTraitsQuad4, sierra::nalu::AlgTraitsHex8> >::execute (this=0x4b343180) at ../src/ngp_algorithms/NodalGradPOpenBoundaryAlg.C:119
#9  0x00000000114401dc in sierra::nalu::NgpAlgDriver::execute (this=0x4b9ecec8) at ../src/ngp_algorithms/NgpAlgDriver.C:48
#10 0x00000000105f54dc in compute_projected_nodal_gradient (this=0x4b9eccd0) at ../src/LowMachEquationSystem.C:3801
#11 sierra::nalu::LowMachEquationSystem::solve_and_update (this=0x4b9ec040) at ../src/LowMachEquationSystem.C:704
#12 0x000000001053d920 in sierra::nalu::EquationSystems::solve_and_update (this=0x54f5b2b8) at ../src/EquationSystems.C:771
#13 0x0000000010727d24 in sierra::nalu::Realm::advance_time_step (this=0x54f5b060) at ../src/Realm.C:1865
#14 0x00000000107aae40 in sierra::nalu::TimeIntegrator::integrate_realm (this=0x4c0a6880) at ../src/TimeIntegrator.C:342
#15 0x00000000107572b0 in sierra::nalu::Simulation::run (this=0x7fffc06c16c8) at ../src/Simulation.C:173
#16 0x0000000010104964 in main (argc=<optimized out>, argv=<optimized out>) at ../nalu.C:178

@PaulMullowney
Copy link
Contributor Author

Each of these tests passes when Trilinos and Nalu are built in Debug.

@PaulMullowney
Copy link
Contributor Author

PaulMullowney commented Mar 5, 2021

A little bit more out of cuda-memcheck with the right environment variables

Kokkos::Cuda::initialize ERROR: likely mismatch of architecture
[a27n02:31735] *** Process received signal ***
[a27n02:31735] Signal: Aborted (6)
[a27n02:31735] Signal code:  (-6)
[a27n02:31735] [ 0] [0x2000000504d8]
[a27n02:31735] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x200027a42094]
[a27n02:31735] [ 2] /gpfs/alpine/cfd116/scratch/mullowne/nalu-wind/build_gpu_master/naluX[0x16691e18]
[a27n02:31735] [ 3] /gpfs/alpine/cfd116/scratch/mullowne/nalu-wind/build_gpu_master/naluX[0x166b23a8]
[a27n02:31735] [ 4] /gpfs/alpine/cfd116/scratch/mullowne/nalu-wind/build_gpu_master/naluX[0x1668d4b0]
[a27n02:31735] [ 5] /gpfs/alpine/cfd116/scratch/mullowne/nalu-wind/build_gpu_master/naluX[0x10103fc0]
[a27n02:31735] [ 6] /lib64/libc.so.6(+0x25200)[0x200027a25200]
[a27n02:31735] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200027a253f4]
[a27n02:31735] *** End of error message ***
========= CUDA-MEMCHECK
========= Error: process didn't terminate successfully
========= Fatal UVM GPU fault of type invalid pte due to invalid address
=========     during read access to address 0x201e6b840000
=========
========= Fatal UVM GPU fault of type invalid pte due to invalid address
=========     during read access to address 0x201e6b840000
=========
========= No CUDA-MEMCHECK results found

@PaulMullowney
Copy link
Contributor Author

PaulMullowney commented May 3, 2021

Just built with Trilinos master (abfd14fbe0d) and develop (c00ff3bb339). Crashes no longer happen. Not sure what to make of this as the Nalu code hasn't changed. Oh well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants