-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfixes for CUDA build (multiphase model kernel wrappers, HypreVector::extract) #1397
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I imagine this is the result of a bug that hopefully wasn't too hard to figure out.
@klevzoff Do we also need to add the |
@francoishamon yes, I would imagine that we do. I will do it in this PR. (I actually got stuck for a bit too, because the fix gets me past the original crash, but then the run fails on linear solve - direct solvers don't work when building with hypre-GPU enabled, which is what I had in my setup - so now I'm taking some time to rebuild everything without hypre-GPU and make sure the entire thing runs without fail) |
Ok for the integrated tests, here is what I see on my side (which is a Total machine that I started using today, so I don't know if what I am writing can be trusted). Considering for instance:
Sorry this is a little bit confusing :) It was the source of some headache this afternoon, maybe I should I have tried Lassen with disabled unified memory instead. |
@francoishamon The last commit should fix the non-convergence issue. At least it gets |
* @note This function exists to enable holding KernelWrapper objects in an ArrayView | ||
* and have their contents properly moved between memory spaces. | ||
*/ | ||
void move( LvArray::MemorySpace const space, bool const touch = false ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a preference to stop using default parameters in most cases. I would consider just dropping the default parameter here if it makes sense in the use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the signature from LvArray::ArrayView
just to be safe. I changed the default from true
to false
though, because for these objects it doesn't make sense to touch them, they're basically meant to be const in the kernel.
In this case we can just drop the default, because it is only meant to be called by ArrayView
, which always does it with both arguments.
void HypreVector::extract( arrayView1d< real64 > const & localVector ) const | ||
{ | ||
GEOSX_LAI_ASSERT_EQ( localSize(), localVector.size() ); | ||
real64 const * const data = extractLocalVector(); | ||
forAll< execPolicy >( localSize(), [=] GEOSX_HOST_DEVICE ( HYPRE_Int const i ) | ||
{ | ||
localVector[i] = data[i]; | ||
} ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the use case in this PR, but I would have a preference to name this something like extractCopy
to be clear about what is happening...not that it shouldn't be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is used in SolverBase::nonlinearImplicitStep
to copy the solution out of LAI vector into an LvArray local vector. It has existed for a long time, but previously it was using the default implementation provided in VectorBase
, which simply does a std::copy
on the host... but it doesn't mark the data as touched on host, and things go wrong from there. I probably agree with you on the name, I was trying to make a minimal bugfix, but can also use this PR for extra cleanup.
Thanks @klevzoff I am re-running the tests (I do not have your latest two commits about uncrustify and removing default parameters). The problem seems fixed for the
that looks very much like the crash that we had before for the other models. I am looking at the CO2 models to see if I understand what is going there. |
Ah, we may have to move the |
Ok I confirm that we also need to move manually the With this change I get the following behavior for the integrated tests:
|
Yes, please |
@francoishamon One thing to note is that when run through |
Actually, sorry, I just pushed my own version of this change (+ a little more cleanup) because I needed to sync from laptop to GPU machine to investigate convergence further. |
I checked all GPU-enabled (i.e. non-PVTPackage) multiphase tests, and they're all working properly on our V100 machine as well. We should probably be a little worried and try to get to the bottom of it (maybe not right now, but eventually). The easiest thing to start with is set (Btw, I'm assuming you're testing a build that has |
@klevzoff You can add the lambda execution context to the uncrustify config file to get rid of these indentations. I think the reason is that uncrustify doesn't recognize the lambda, so I had to add the qualifiers so that it wouldn't drop out of the lambda context. |
@@ -787,7 +787,7 @@ void HypreMatrix::scale( real64 const scalingFactor ) | |||
HYPRE_Int const diag_nnz = hypre_CSRMatrixNumNonzeros( prt_diag_CSR ); | |||
HYPRE_Real * const ptr_diag_data = hypre_CSRMatrixData( prt_diag_CSR ); | |||
|
|||
forAll< execPolicy >( diag_nnz, [=] GEOSX_HOST_DEVICE ( HYPRE_Int const i ) | |||
forAll< hypre::execPolicy >( diag_nnz, [=] GEOSX_HYPRE_HOST_DEVICE ( HYPRE_Int const i ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is hypre::execPolicy
defined and how is it set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh...duh.
Resolves #1394