Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) #1397

klevzoff · 2021-04-16T03:33:45Z

Resolves #1394

corbett5

Nice! I imagine this is the result of a bug that hopefully wasn't too hard to figure out.

francoishamon · 2021-04-16T04:29:32Z

@klevzoff Do we also need to add the move functions to the PVTFunctions because of the code below:
https://github.com/GEOSX/GEOSX/blob/b9df14663d1d188e3ea554086d65bea64164b9c2/src/coreComponents/constitutive/fluid/MultiPhaseMultiComponentFluid.hpp#L213-L226
if we want the moves to be propagated to the TableFunctions that are inside? I am checking the integrated tests with your fix on my side but I am a little bit behind.

klevzoff · 2021-04-16T04:33:55Z

@francoishamon yes, I would imagine that we do. I will do it in this PR.

(I actually got stuck for a bit too, because the fix gets me past the original crash, but then the run fails on linear solve - direct solvers don't work when building with hypre-GPU enabled, which is what I had in my setup - so now I'm taking some time to rebuild everything without hypre-GPU and make sure the entire thing runs without fail)

francoishamon · 2021-04-16T05:39:17Z

Ok for the integrated tests, here is what I see on my side (which is a Total machine that I started using today, so I don't know if what I am writing can be trusted). Considering for instance: deadoil_3ph_baker_1d.xml

Without the move fix: the code crashes
This branch: the code does not crash, but the integrated test does not converge. The true compositional tests (which do not use TableFunction) do not converge either.
develop after PR WIP - circular dependencies / object libraries build #1352, with the move fix: the code does not crash, the integrated test works.

Sorry this is a little bit confusing :) It was the source of some headache this afternoon, maybe I should I have tried Lassen with disabled unified memory instead.

klevzoff · 2021-04-16T23:50:34Z

@francoishamon The last commit should fix the non-convergence issue. At least it gets deadoil_3ph_baker_1d.xml to complete and match a non-CUDA build.

rrsettgast · 2021-04-16T23:56:51Z

src/coreComponents/functions/TableFunction.hpp

+     * @note This function exists to enable holding KernelWrapper objects in an ArrayView
+     *       and have their contents properly moved between memory spaces.
+     */
+    void move( LvArray::MemorySpace const space, bool const touch = false )


I have a preference to stop using default parameters in most cases. I would consider just dropping the default parameter here if it makes sense in the use cases.

I copied the signature from LvArray::ArrayView just to be safe. I changed the default from true to false though, because for these objects it doesn't make sense to touch them, they're basically meant to be const in the kernel.

In this case we can just drop the default, because it is only meant to be called by ArrayView, which always does it with both arguments.

rrsettgast · 2021-04-17T00:00:09Z

src/coreComponents/linearAlgebra/interfaces/hypre/HypreVector.cpp

+void HypreVector::extract( arrayView1d< real64 > const & localVector ) const
+{
+  GEOSX_LAI_ASSERT_EQ( localSize(), localVector.size() );
+  real64 const * const data = extractLocalVector();
+  forAll< execPolicy >( localSize(), [=] GEOSX_HOST_DEVICE ( HYPRE_Int const i )
+  {
+    localVector[i] = data[i];
+  } );
+}


I don't see the use case in this PR, but I would have a preference to name this something like extractCopy to be clear about what is happening...not that it shouldn't be clear.

This method is used in SolverBase::nonlinearImplicitStep to copy the solution out of LAI vector into an LvArray local vector. It has existed for a long time, but previously it was using the default implementation provided in VectorBase, which simply does a std::copy on the host... but it doesn't mark the data as touched on host, and things go wrong from there. I probably agree with you on the name, I was trying to make a minimal bugfix, but can also use this PR for extra cleanup.

francoishamon · 2021-04-17T01:05:19Z

@francoishamon The last commit should fix the non-convergence issue. At least it gets deadoil_3ph_baker_1d.xml to complete and match a non-CUDA build.

Thanks @klevzoff I am re-running the tests (I do not have your latest two commits about uncrustify and removing default parameters). The problem seems fixed for the DeadOilFluid and TableRelativePermeability models. But I still get an error for CO2 models. For instance, for co2_flux_3d.xml:

CUDAassert: an illegal memory access was encountered /data/gpfs/Users/GEOSX/thirdPartyLibs/install-pecan-release/raja/include/RAJA/policy/cuda/MemUtils_CUDA.hpp 183            
terminate called after throwing an instance of 'umpire::util::Exception'                                                                                                                 
  what():  ! Umpire Exception [/data/gpfs/Users/GEOSX/thirdPartyLibs/build-pecan-release/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaMallocAllocator.hpp:56]:  deallocate
 cudaFree( ptr = 0x7f4dea094800 ) failed with error: an illegal memory access was encountered

that looks very much like the crash that we had before for the other models. I am looking at the CO2 models to see if I understand what is going there.

francoishamon · 2021-04-17T01:32:15Z

Ah, we may have to move the m_componentMolarWeight arrays that live in the base classes, like PVTFunctionBase and FlashModelBase. When I move them, the code seems to run, but I need to confirm on the other tests

francoishamon · 2021-04-17T02:46:14Z

Ok I confirm that we also need to move manually the m_componentMolarWeight by hand, and then everything is properly moved. @klevzoff should I commit my changes to your branch?

With this change I get the following behavior for the integrated tests:

Pecan (Total machine): I can run the integrated tests. I did not set up the geosxats script but I think everything is working well
Lassen with --atsdisable: I don't get a crash in DeadOilFluid, TableRelativePermeability, CO2-brine models, but the integrated tests are not converging, as if some data was still not moved properly. @klevzoff @corbett5 should we care (I assume that yes)? Is there another function in the LAI that I have to modify for Lassen with --atsdisable specifically?

klevzoff · 2021-04-17T03:23:40Z

@klevzoff should I commit my changes to your branch?

Yes, please

corbett5 · 2021-04-17T03:48:23Z

@francoishamon One thing to note is that when run through geosxats with --atsdisable the integrated tests should fail (as in segfault). This is because the integrated tests don't use pinned memory (it's a bug) and because when we don't use pinned memory we pack into non-pinned host buffers which is only supported with ATS (another bug). #1285

klevzoff · 2021-04-17T07:45:09Z

@klevzoff should I commit my changes to your branch?

Yes, please

Actually, sorry, I just pushed my own version of this change (+ a little more cleanup) because I needed to sync from laptop to GPU machine to investigate convergence further.

klevzoff · 2021-04-17T08:02:28Z

Lassen with --atsdisable: I don't get a crash in DeadOilFluid, TableRelativePermeability, CO2-brine models, but the integrated tests are not converging, as if some data was still not moved properly. @klevzoff @corbett5 should we care (I assume that yes)? Is there another function in the LAI that I have to modify for Lassen with --atsdisable specifically?

I checked all GPU-enabled (i.e. non-PVTPackage) multiphase tests, and they're all working properly on our V100 machine as well. We should probably be a little worried and try to get to the bottom of it (maybe not right now, but eventually). The easiest thing to start with is set logLevel="3" on the solver for some small-ish test (10 cells), have it dump all the linear systems to files for both a CUDA and non-CUDA build, and compare and see where it starts diffing.

(Btw, I'm assuming you're testing a build that has ENABLE_HYPRE_CUDA=OFF)

rrsettgast · 2021-04-19T05:57:02Z

@klevzoff You can add the lambda execution context to the uncrustify config file to get rid of these indentations. I think the reason is that uncrustify doesn't recognize the lambda, so I had to add the qualifiers so that it wouldn't drop out of the lambda context.

https://github.com/GEOSX/GEOSX/blob/b9df14663d1d188e3ea554086d65bea64164b9c2/src/uncrustify.cfg#L2153

rrsettgast · 2021-04-19T17:21:43Z

src/coreComponents/linearAlgebra/interfaces/hypre/HypreMatrix.cpp

@@ -787,7 +787,7 @@ void HypreMatrix::scale( real64 const scalingFactor )
  HYPRE_Int const diag_nnz = hypre_CSRMatrixNumNonzeros( prt_diag_CSR );
  HYPRE_Real * const ptr_diag_data = hypre_CSRMatrixData( prt_diag_CSR );

-  forAll< execPolicy >( diag_nnz, [=] GEOSX_HOST_DEVICE ( HYPRE_Int const i )
+  forAll< hypre::execPolicy >( diag_nnz, [=] GEOSX_HYPRE_HOST_DEVICE ( HYPRE_Int const i )


where is hypre::execPolicy defined and how is it set?

Here: https://github.com/GEOSX/GEOSX/blob/8cd1d4c98912630f0b418c3086cfab48106032c3/src/coreComponents/linearAlgebra/interfaces/hypre/HypreUtils.hpp#L67-L73

Add move() function to TableFunction::KernelWrapper

74eb0d7

klevzoff requested review from francoishamon and corbett5 April 16, 2021 03:33

klevzoff self-assigned this Apr 16, 2021

corbett5 approved these changes Apr 16, 2021

View reviewed changes

klevzoff mentioned this pull request Apr 16, 2021

Crash in TableFunction on GPU #1394

Closed

francoishamon approved these changes Apr 16, 2021

View reviewed changes

klevzoff added 2 commits April 16, 2021 15:51

Add move function to PVTFunctions kernel wrappers

43e9a81

Fix HypreVector::extract for CUDA builds

869c127

klevzoff marked this pull request as ready for review April 16, 2021 23:48

rrsettgast approved these changes Apr 17, 2021

View reviewed changes

klevzoff changed the title ~~Add move capability to TableFunction::KernelWrapper~~ Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) Apr 17, 2021

klevzoff added effort: 1 day type: bug Something isn't working ci: run CUDA builds Allows to triggers (costly) CUDA jobs labels Apr 17, 2021

klevzoff added 2 commits April 16, 2021 17:45

Uncrustify

25a9241

Remove default parameter value in PVT kernel wrapper move() methods

ecdf070

Move missing data members + make virtual

f177b3f

Fix CUDA HD errors in Hypre interface

3543d18

klevzoff added 2 commits April 17, 2021 01:32

Uncrustify has gone mad

3a17005

Return to sanity

8cd1d4c

klevzoff mentioned this pull request Apr 19, 2021

Cleanup hypre build GEOS-DEV/thirdPartyLibs#156

Merged

rrsettgast reviewed Apr 19, 2021

View reviewed changes

klevzoff added 2 commits April 19, 2021 23:13

Update TPL tag

2fad048

Update TPL tag again

ae3cddc

klevzoff merged commit 4efc0a6 into develop Apr 20, 2021

klevzoff deleted the bugfix/table-function-device-move branch April 20, 2021 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) #1397

Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) #1397

klevzoff commented Apr 16, 2021

corbett5 left a comment

francoishamon commented Apr 16, 2021

klevzoff commented Apr 16, 2021 •

edited

Loading

francoishamon commented Apr 16, 2021 •

edited

Loading

klevzoff commented Apr 16, 2021

rrsettgast Apr 16, 2021

klevzoff Apr 17, 2021

rrsettgast Apr 17, 2021

klevzoff Apr 17, 2021

francoishamon commented Apr 17, 2021 •

edited

Loading

francoishamon commented Apr 17, 2021

francoishamon commented Apr 17, 2021 •

edited

Loading

klevzoff commented Apr 17, 2021

corbett5 commented Apr 17, 2021

klevzoff commented Apr 17, 2021

klevzoff commented Apr 17, 2021 •

edited

Loading

rrsettgast commented Apr 19, 2021

rrsettgast Apr 19, 2021

klevzoff Apr 19, 2021

rrsettgast Apr 19, 2021

Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) #1397

Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) #1397

Conversation

klevzoff commented Apr 16, 2021

corbett5 left a comment

Choose a reason for hiding this comment

francoishamon commented Apr 16, 2021

klevzoff commented Apr 16, 2021 • edited Loading

francoishamon commented Apr 16, 2021 • edited Loading

klevzoff commented Apr 16, 2021

rrsettgast Apr 16, 2021

Choose a reason for hiding this comment

klevzoff Apr 17, 2021

Choose a reason for hiding this comment

rrsettgast Apr 17, 2021

Choose a reason for hiding this comment

klevzoff Apr 17, 2021

Choose a reason for hiding this comment

francoishamon commented Apr 17, 2021 • edited Loading

francoishamon commented Apr 17, 2021

francoishamon commented Apr 17, 2021 • edited Loading

klevzoff commented Apr 17, 2021

corbett5 commented Apr 17, 2021

klevzoff commented Apr 17, 2021

klevzoff commented Apr 17, 2021 • edited Loading

rrsettgast commented Apr 19, 2021

rrsettgast Apr 19, 2021

Choose a reason for hiding this comment

klevzoff Apr 19, 2021

Choose a reason for hiding this comment

rrsettgast Apr 19, 2021

Choose a reason for hiding this comment

klevzoff commented Apr 16, 2021 •

edited

Loading

francoishamon commented Apr 16, 2021 •

edited

Loading

francoishamon commented Apr 17, 2021 •

edited

Loading

francoishamon commented Apr 17, 2021 •

edited

Loading

klevzoff commented Apr 17, 2021 •

edited

Loading