Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) #1397

Merged
merged 11 commits into from
Apr 20, 2021

Conversation

klevzoff
Copy link
Contributor

Resolves #1394

@klevzoff klevzoff self-assigned this Apr 16, 2021
Copy link
Contributor

@corbett5 corbett5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I imagine this is the result of a bug that hopefully wasn't too hard to figure out.

@francoishamon
Copy link
Contributor

@klevzoff Do we also need to add the move functions to the PVTFunctions because of the code below:
https://github.com/GEOSX/GEOSX/blob/b9df14663d1d188e3ea554086d65bea64164b9c2/src/coreComponents/constitutive/fluid/MultiPhaseMultiComponentFluid.hpp#L213-L226
if we want the moves to be propagated to the TableFunctions that are inside? I am checking the integrated tests with your fix on my side but I am a little bit behind.

@klevzoff
Copy link
Contributor Author

klevzoff commented Apr 16, 2021

@francoishamon yes, I would imagine that we do. I will do it in this PR.

(I actually got stuck for a bit too, because the fix gets me past the original crash, but then the run fails on linear solve - direct solvers don't work when building with hypre-GPU enabled, which is what I had in my setup - so now I'm taking some time to rebuild everything without hypre-GPU and make sure the entire thing runs without fail)

@francoishamon
Copy link
Contributor

francoishamon commented Apr 16, 2021

Ok for the integrated tests, here is what I see on my side (which is a Total machine that I started using today, so I don't know if what I am writing can be trusted). Considering for instance: deadoil_3ph_baker_1d.xml

  1. Without the move fix: the code crashes

  2. This branch: the code does not crash, but the integrated test does not converge. The true compositional tests (which do not use TableFunction) do not converge either.

  3. develop after PR WIP - circular dependencies / object libraries build #1352, with the move fix: the code does not crash, the integrated test works.

Sorry this is a little bit confusing :) It was the source of some headache this afternoon, maybe I should I have tried Lassen with disabled unified memory instead.

@klevzoff klevzoff marked this pull request as ready for review April 16, 2021 23:48
@klevzoff
Copy link
Contributor Author

@francoishamon The last commit should fix the non-convergence issue. At least it gets deadoil_3ph_baker_1d.xml to complete and match a non-CUDA build.

* @note This function exists to enable holding KernelWrapper objects in an ArrayView
* and have their contents properly moved between memory spaces.
*/
void move( LvArray::MemorySpace const space, bool const touch = false )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a preference to stop using default parameters in most cases. I would consider just dropping the default parameter here if it makes sense in the use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the signature from LvArray::ArrayView just to be safe. I changed the default from true to false though, because for these objects it doesn't make sense to touch them, they're basically meant to be const in the kernel.

In this case we can just drop the default, because it is only meant to be called by ArrayView, which always does it with both arguments.

Comment on lines 591 to 599
void HypreVector::extract( arrayView1d< real64 > const & localVector ) const
{
GEOSX_LAI_ASSERT_EQ( localSize(), localVector.size() );
real64 const * const data = extractLocalVector();
forAll< execPolicy >( localSize(), [=] GEOSX_HOST_DEVICE ( HYPRE_Int const i )
{
localVector[i] = data[i];
} );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the use case in this PR, but I would have a preference to name this something like extractCopy to be clear about what is happening...not that it shouldn't be clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used in SolverBase::nonlinearImplicitStep to copy the solution out of LAI vector into an LvArray local vector. It has existed for a long time, but previously it was using the default implementation provided in VectorBase, which simply does a std::copy on the host... but it doesn't mark the data as touched on host, and things go wrong from there. I probably agree with you on the name, I was trying to make a minimal bugfix, but can also use this PR for extra cleanup.

@klevzoff klevzoff changed the title Add move capability to TableFunction::KernelWrapper Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) Apr 17, 2021
@klevzoff klevzoff added effort: 1 day type: bug Something isn't working ci: run CUDA builds Allows to triggers (costly) CUDA jobs labels Apr 17, 2021
@francoishamon
Copy link
Contributor

francoishamon commented Apr 17, 2021

@francoishamon The last commit should fix the non-convergence issue. At least it gets deadoil_3ph_baker_1d.xml to complete and match a non-CUDA build.

Thanks @klevzoff I am re-running the tests (I do not have your latest two commits about uncrustify and removing default parameters). The problem seems fixed for the DeadOilFluid and TableRelativePermeability models. But I still get an error for CO2 models. For instance, for co2_flux_3d.xml:

CUDAassert: an illegal memory access was encountered /data/gpfs/Users/GEOSX/thirdPartyLibs/install-pecan-release/raja/include/RAJA/policy/cuda/MemUtils_CUDA.hpp 183            
terminate called after throwing an instance of 'umpire::util::Exception'                                                                                                                 
  what():  ! Umpire Exception [/data/gpfs/Users/GEOSX/thirdPartyLibs/build-pecan-release/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaMallocAllocator.hpp:56]:  deallocate
 cudaFree( ptr = 0x7f4dea094800 ) failed with error: an illegal memory access was encountered                                                                                            

that looks very much like the crash that we had before for the other models. I am looking at the CO2 models to see if I understand what is going there.

@francoishamon
Copy link
Contributor

Ah, we may have to move the m_componentMolarWeight arrays that live in the base classes, like PVTFunctionBase and FlashModelBase. When I move them, the code seems to run, but I need to confirm on the other tests

@francoishamon
Copy link
Contributor

francoishamon commented Apr 17, 2021

Ok I confirm that we also need to move manually the m_componentMolarWeight by hand, and then everything is properly moved. @klevzoff should I commit my changes to your branch?

With this change I get the following behavior for the integrated tests:

  • Pecan (Total machine): I can run the integrated tests. I did not set up the geosxats script but I think everything is working well
  • Lassen with --atsdisable: I don't get a crash in DeadOilFluid, TableRelativePermeability, CO2-brine models, but the integrated tests are not converging, as if some data was still not moved properly. @klevzoff @corbett5 should we care (I assume that yes)? Is there another function in the LAI that I have to modify for Lassen with --atsdisable specifically?

@klevzoff
Copy link
Contributor Author

@klevzoff should I commit my changes to your branch?

Yes, please

@corbett5
Copy link
Contributor

@francoishamon One thing to note is that when run through geosxats with --atsdisable the integrated tests should fail (as in segfault). This is because the integrated tests don't use pinned memory (it's a bug) and because when we don't use pinned memory we pack into non-pinned host buffers which is only supported with ATS (another bug). #1285

@klevzoff
Copy link
Contributor Author

@klevzoff should I commit my changes to your branch?

Yes, please

Actually, sorry, I just pushed my own version of this change (+ a little more cleanup) because I needed to sync from laptop to GPU machine to investigate convergence further.

@klevzoff
Copy link
Contributor Author

klevzoff commented Apr 17, 2021

  • Lassen with --atsdisable: I don't get a crash in DeadOilFluid, TableRelativePermeability, CO2-brine models, but the integrated tests are not converging, as if some data was still not moved properly. @klevzoff @corbett5 should we care (I assume that yes)? Is there another function in the LAI that I have to modify for Lassen with --atsdisable specifically?

I checked all GPU-enabled (i.e. non-PVTPackage) multiphase tests, and they're all working properly on our V100 machine as well. We should probably be a little worried and try to get to the bottom of it (maybe not right now, but eventually). The easiest thing to start with is set logLevel="3" on the solver for some small-ish test (10 cells), have it dump all the linear systems to files for both a CUDA and non-CUDA build, and compare and see where it starts diffing.

(Btw, I'm assuming you're testing a build that has ENABLE_HYPRE_CUDA=OFF)

@rrsettgast
Copy link
Member

@klevzoff You can add the lambda execution context to the uncrustify config file to get rid of these indentations. I think the reason is that uncrustify doesn't recognize the lambda, so I had to add the qualifiers so that it wouldn't drop out of the lambda context.

https://github.com/GEOSX/GEOSX/blob/b9df14663d1d188e3ea554086d65bea64164b9c2/src/uncrustify.cfg#L2153

@@ -787,7 +787,7 @@ void HypreMatrix::scale( real64 const scalingFactor )
HYPRE_Int const diag_nnz = hypre_CSRMatrixNumNonzeros( prt_diag_CSR );
HYPRE_Real * const ptr_diag_data = hypre_CSRMatrixData( prt_diag_CSR );

forAll< execPolicy >( diag_nnz, [=] GEOSX_HOST_DEVICE ( HYPRE_Int const i )
forAll< hypre::execPolicy >( diag_nnz, [=] GEOSX_HYPRE_HOST_DEVICE ( HYPRE_Int const i )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is hypre::execPolicy defined and how is it set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh...duh.

@klevzoff klevzoff merged commit 4efc0a6 into develop Apr 20, 2021
@klevzoff klevzoff deleted the bugfix/table-function-device-move branch April 20, 2021 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: run CUDA builds Allows to triggers (costly) CUDA jobs type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash in TableFunction on GPU
4 participants