Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in TableFunction on GPU #1394

Closed
francoishamon opened this issue Apr 15, 2021 · 4 comments · Fixed by #1397
Closed

Crash in TableFunction on GPU #1394

francoishamon opened this issue Apr 15, 2021 · 4 comments · Fixed by #1397
Assignees
Labels
type: bug Something isn't working

Comments

@francoishamon
Copy link
Contributor

francoishamon commented Apr 15, 2021

Describe the bug
On GPU platforms other than Lassen, the integrated tests that use constitutive models based on TableFunction (dead-oil, relperms) fail at initialization with the following error:

** StackTrace of 24 frames **
Frame 0: LvArray::system::stackTraceHandler(int, bool) 
Frame 1: /home/jfranc/code/cxx/GEOSX/build-gcc8-cuda-debug/lib/libgeosx_core.so 
Frame 2: /home/jfranc/code/cxx/GEOSX/build-gcc8-cuda-debug/lib/libgeosx_core.so 
Frame 3: /lib/x86_64-linux-gnu/libc.so.6 
Frame 4: gsignal 
Frame 5: abort 
Frame 6: /lib/x86_64-linux-gnu/libstdc++.so.6 
Frame 7: /lib/x86_64-linux-gnu/libstdc++.so.6 
Frame 8: /lib/x86_64-linux-gnu/libstdc++.so.6 
Frame 9: __gxx_personality_v0 
Frame 10: /lib/x86_64-linux-gnu/libgcc_s.so.1 
Frame 11: _Unwind_Resume 
Frame 12: LvArray::ChaiBuffer<geosx::TableFunction::KernelWrapper>::free() 
Frame 13: void LvArray::bufferManipulation::free<LvArray::ChaiBuffer<geosx::TableFunction::KernelWrapper> >(LvArray::ChaiBuffer<geosx::TableFunction::KernelWrapper>&, long) 
Frame 14: LvArray::Array<geosx::TableFunction::KernelWrapper, 1, camp::int_seq<long, 0l>, long, LvArray::ChaiBuffer>::~Array() 
Frame 15: geosx::constitutive::DeadOilFluid::~DeadOilFluid() 
Frame 16: geosx::constitutive::DeadOilFluid::~DeadOilFluid() 
Frame 17: std::enable_if<std::is_same<geosx::dataRepository::Group*, geosx::dataRepository::Group*>::value, void>::type geosx::MappedVector<geosx::dataRepository::Group, geosx::dataRepository::Group*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long>::deleteValue<geosx::dataRepository::Group*>(long) 
Frame 18: geosx::MappedVector<geosx::dataRepository::Group, geosx::dataRepository::Group*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long>::clear() 
Frame 19: geosx::MappedVector<geosx::dataRepository::Group, geosx::dataRepository::Group*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long>::~MappedVector() 
Frame 20: geosx::dataRepository::Group::~Group() 
Frame 21: geosx::ElementSubRegionBase::~ElementSubRegionBase() 
Frame 22: geosx::CellBlock::~CellBlock() 
Frame 23: geosx::CellElementSubRegion::~CellElementSubRegion() 
Frame 24: geosx::CellElementSubRegion::~CellElementSubRegion()

so I must have made a mistake somewhere, but did not notice it on Lassen (everything works there).

To Reproduce
Steps to reproduce the behavior:

  1. Compile GEOSX (gcc) on Stanford GPU's cluster, or Pecan (Total), or Pangea 3 (Total)
  2. Run an integrated test that uses TableFunction
    I did not manage to reproduce the crash on Lassen.

Expected behavior
Should work fine on every platform.

Platform (please complete the following information):

  • Stanford GPU's cluster, or Pecan (Total), or Pangea 3 (Total)
  • gcc
  • develop
@francoishamon francoishamon added type: bug Something isn't working priority: critical labels Apr 15, 2021
@francoishamon francoishamon self-assigned this Apr 15, 2021
@klevzoff
Copy link
Contributor

klevzoff commented Apr 15, 2021

This stack trace is actually just the aftermath of geosx trying to quit, the error first happens earlier in the FluidUpdateKernel.

From cuda-gdb:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x55557849c4b0 (ArrayOfArraysView.hpp:409)

Thread 1 "geosx" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 37, block (0,0,0), thread (32,0,0), device 0, sm 0, warp 1, lane 0]
0x000055557849c4c0 in LvArray::ArrayOfArraysView<double const, int const, true, LvArray::ChaiBuffer>::operator[] (this=0x7fff3e615c08, i=0)
    at /home/klevtsov/GEOSX/src/coreComponents/LvArray/src/ArrayOfArraysView.hpp:409
409	    return ArraySlice< T, 1, 0, INDEX_TYPE_NC >( m_values.data() + m_offsets[ i ], &m_sizes[ i ], nullptr );
(cuda-gdb) bt
#0  0x000055557849c4c0 in LvArray::ArrayOfArraysView<double const, int const, true, LvArray::ChaiBuffer>::operator[] (this=0x7fff3e615c08, i=0)
    at /home/klevtsov/GEOSX/src/coreComponents/LvArray/src/ArrayOfArraysView.hpp:409
#1  0x00005555787cb440 in geosx::TableFunction::KernelWrapper::compute<double*, double*>(double* const&, double&, double*&&) const (this=0x7fff3e615c00, input=<optimized out>, value=<optimized out>, 
    derivatives=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/functions/TableFunction.hpp:338
#2  0x0000555578c48bf0 in geosx::constitutive::DeadOilFluidUpdate::computeViscosities (this=0x7fffe2fff518, pressure=5000000, phaseVisc=0x7fffe2ffea70, dPhaseVisc_dPres=0x7fffe2ffeaa0, 
    dPhaseVisc_dGlobalCompFrac=0x7fffe2ffeb00) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:571
#3  geosx::constitutive::DeadOilFluidUpdate::compute (this=0x7fffe2fff518, pressure=<optimized out>, temperature=<optimized out>, composition=0x7fffe2ffec68, phaseFraction=0x7fffe2ffe830, 
    dPhaseFraction_dPressure=0x7fffe2ffe860, dPhaseFraction_dTemperature=0x7fffe2ffe890, dPhaseFraction_dGlobalCompFraction=0x7fffe2ffe8c0, phaseDensity=0x7fffe2ffe8f0, 
    dPhaseDensity_dPressure=0x7fffe2ffe920, dPhaseDensity_dTemperature=0x7fffe2ffe950, dPhaseDensity_dGlobalCompFraction=0x7fffe2ffe980, phaseMassDensity=0x7fffe2ffe9b0, 
    dPhaseMassDensity_dPressure=0x7fffe2ffe9e0, dPhaseMassDensity_dTemperature=0x7fffe2ffea10, dPhaseMassDensity_dGlobalCompFraction=0x7fffe2ffea40, phaseViscosity=0x7fffe2ffea70, 
    dPhaseViscosity_dPressure=0x7fffe2ffeaa0, dPhaseViscosity_dTemperature=0x7fffe2ffead0, dPhaseViscosity_dGlobalCompFraction=0x7fffe2ffeb00, phaseCompFraction=0x7fffe2ffeb30, 
    dPhaseCompFraction_dPressure=0x7fffe2ffeb60, dPhaseCompFraction_dTemperature=0x7fffe2ffeb90, dPhaseCompFraction_dGlobalCompFraction=0x7fffe2ffebc0, totalDensity=<optimized out>, 
    dTotalDensity_dPressure=<optimized out>, dTotalDensity_dTemperature=<optimized out>, dTotalDensity_dGlobalCompFraction=0x7fffe2ffec38, pressure=<optimized out>, temperature=<optimized out>, 
    totalDensity=<optimized out>, dTotalDensity_dPressure=<optimized out>, dTotalDensity_dTemperature=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:679
#4  geosx::constitutive::DeadOilFluidUpdate::update (this=<optimized out>, k=<optimized out>, q=<optimized out>, pressure=<optimized out>, temperature=<optimized out>, composition=0x7fffe2ffec68, 
    this=<optimized out>, k=<optimized out>, q=<optimized out>, pressure=<optimized out>, temperature=<optimized out>) at /home/klevtsov/GEOSX/src/coreComponents/constitutive/fluid/DeadOilFluid.hpp:175
#5  void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}::operator()(int) const (this=<optimized out>, k=<optimized out>)
    at /home/klevtsov/GEOSX/src/coreComponents/physicsSolvers/fluidFlow/CompositionalMultiphaseBaseKernels.hpp:159
#6  0x000055557896f120 in RAJA::policy::cuda::impl::forall_cuda_kernel<256ul, RAJA::Iterators::numeric_iterator<int, int, int*>, void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}, int>(void geosx::CompositionalMultiphaseBaseKernels::FluidUpdateKernel::launch<RAJA::policy::cuda::cuda_exec<256ul, false>, geosx::constitutive::DeadOilFluidUpdate>(int, geosx::constitutive::DeadOilFluidUpdate const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, LvArray::ArrayView<double const, 1, 0, int, LvArray::ChaiBuffer> const&, double, LvArray::ArrayView<double const, 2, 1, int, LvArray::ChaiBuffer> const&)::{lambda(int)#1}, RAJA::Iterators::numeric_iterator<int, int, int*>, int)<<<(1,1,1),(256,1,1)>>> (loop_body=..., idx=..., length=75)
    at /home/geosx/thirdPartyLibs/install-gcc8-cuda-release/raja/include/RAJA/policy/cuda/forall.hpp:146

Basically, the TableFunction::KernelWrapper object is trying to get a slice of m_coordinates which is an ArrayOfArraysView. However, its buffers have not been properly moved onto device, for example trying to print m_coordinates.m_offsets.m_pointer[0] in the debugger throws a memory access error. Same for m_values.m_pointer[0].

All in all, it looks as if the TableFunction::KernelWrapper object has been bitwise-copied onto device, instead of properly copy-constructed. It just used to work on Lassen because of its magic access to host memory.

@klevzoff
Copy link
Contributor

The root of the problem is the same issue @corbett5 had to work around for nested arrays. Whenever ChaiBuffer moves its data between memory spaces, it requests chai::ArrayManager to do the job, which in turn calls to umpire::ResourceManager, which does the memory transfer, but without invoking copy constructors on the objects being copied (it's a void * API, so no type information is available). ChaiBuffer then tries to call move( space, touch ) on its contained objects to deal with nested arrays correctly. I plugged the whole by adding a similar move( space, touch ) function to TableFunction::KernelWrapper (see #1397), but I don't know if that's a viable strategy going forward. We should probably still consider bringing something like chai::managed_ptr over.

@corbett5
Copy link
Contributor

corbett5 commented Apr 16, 2021

@francoishamon when you get errors like this in the future try grabbing an allocation with lalloc --atsdisable which turns off the magic access to host memory that @klevzoff mentioned.

@francoishamon
Copy link
Contributor Author

Ok I see. Thanks a lot for fixing the problem @klevzoff !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants