-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPhys wrapper and boundary condition fail flag fixes #224
Conversation
Codecov Report
@@ Coverage Diff @@
## main #224 +/- ##
==========================================
- Coverage 41.35% 41.30% -0.06%
==========================================
Files 13 13
Lines 3794 3806 +12
==========================================
+ Hits 1569 1572 +3
- Misses 2225 2234 +9
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…uring bc checks in the fortran layer
…ecause we catch this error in the python layer now
I fixed a bug that involves error checking for inflow boundary conditions. Previously, if there was a mesh failure near an inflow boundary condition that caused the normal vector on the bc face to flip direction, the proc that contained that bc cell would hang due to an mpi barrier call. With this fix, the flipped normal vector will throw a routineFailed error that we catch in the python layer after we update and check the bc's. I plan to add testing for this error case as well as the mphys wrapper updates in a future PR. |
…ser surface integrations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks alright, I just need a few clarifications that might be added to the code for clarity
src/bcdata/BCData.F90
Outdated
@@ -917,7 +917,7 @@ subroutine totalSubsonicInlet | |||
&some faces.") | |||
|
|||
call returnFail("totalSubsonicInlet", errorMessage) | |||
call mpi_barrier(ADflow_comm_world, ierr) | |||
! call mpi_barrier(ADflow_comm_world, ierr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just delete this? Otw I would add a comment explaining why this was commented out - even something like "see PR #224"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will delete this it is not needed anymore. I can't see a reason we want a barrier instead of a fail flag here, because in the optimization loop on HPC, some procs will have the BC dataset and not others, so without the returnFail it will just hang on those procs when the BC normals are flipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you verified that this does work as intended now? Would the code get stuck at any other barriers until it reaches that mpi allreduce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen any issues since making this change on my fork, but I can look for the barriers associated with the boundary conditions and see if any of them should be calling returnFail
in the same fashion.
adflow/mphys/mphys_adflow.py
Outdated
@@ -1147,11 +1149,17 @@ def __init__( | |||
restart_failed_analysis=False, # retry after failed analysis | |||
err_on_convergence_fail=False, # raise an analysis error if the solver stalls | |||
balance_group=None, | |||
my_surfs=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here or below to explain this arg and where it gets passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an explanation, the order MPhys was initializing ADflow made it impossible to set custom surface family groups. This is a bit of a hack for allowing users to add a custom surface family, but I think it might need to be expanded using a dictionary to add multiple custom surfaces. I will make these changes and add documentation to explain what's happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think other BC routines also have similar issues, its not just the one you fixed. In fact, the others just terminate the code instead of hanging. E.g. https://github.com/mdolab/adflow/blob/main/src/bcdata/BCData.F90#L969 Can we apply the same fix to the others? Are you sure the code does not hang and the processors end up with the correct fail flags when you do the allreduce in python?
I can confirm that the procs end up with the correct fail flags when I do the allreduce in the python layer. I've tested this functionality with my application on hpc, but also with smaller toy problems locally. I can look into adding tests for this behavior, but it might be involved to manufacture a toy problem at a small scale where the BC data is only on some procs and not others.
for f in self.curAP.evalFuncs: | ||
# We need to use a list to have the same ordering on every proc, but | ||
# we need 'set like' behavior so we don't include duplicate group names. | ||
groupNames = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, it looks like the bwd version was already using a similar approach
@@ -210,6 +210,7 @@ subroutine integrateUserSurfaces_d(localValues, localValuesd, famList, sps) | |||
allocate(recvBuffer1d(6*nPts), recvBuffer2d(3*nPts)) | |||
else | |||
allocate(recvBuffer1(0), recvBuffer2(0)) | |||
allocate(recvBuffer1d(0), recvBuffer2d(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this taking care of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change from @anilyil that I included, but I believe this is allocating memory for the array used in the derivative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a change needed to get the fwd AD working for user surfaces, which was broken before this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor detail comments. One general comment: Have you tried the MPhys examples with this change? Is the behavior the same when we dont specify a custom surface?
@@ -210,6 +210,7 @@ subroutine integrateUserSurfaces_d(localValues, localValuesd, famList, sps) | |||
allocate(recvBuffer1d(6*nPts), recvBuffer2d(3*nPts)) | |||
else | |||
allocate(recvBuffer1(0), recvBuffer2(0)) | |||
allocate(recvBuffer1d(0), recvBuffer2d(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a change needed to get the fwd AD working for user surfaces, which was broken before this
src/bcdata/BCData.F90
Outdated
@@ -917,7 +917,7 @@ subroutine totalSubsonicInlet | |||
&some faces.") | |||
|
|||
call returnFail("totalSubsonicInlet", errorMessage) | |||
call mpi_barrier(ADflow_comm_world, ierr) | |||
! call mpi_barrier(ADflow_comm_world, ierr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you verified that this does work as intended now? Would the code get stuck at any other barriers until it reaches that mpi allreduce?
Also, I think other BC routines also have similar issues, its not just the one you fixed. In fact, the others just terminate the code instead of hanging. E.g. https://github.com/mdolab/adflow/blob/main/src/bcdata/BCData.F90#L969 Can we apply the same fix to the others? Are you sure the code does not hang and the processors end up with the correct fail flags when you do the allreduce in python? |
I looked over Line 1119 in bbfaa70
terminate call at this line: https://github.com/mdolab/adflow/blob/main/src/bcdata/BCData.F90#L969 require us to throw a fail flag? I would think if one of the BC variables is not specified we want to terminate instead of throw a fail flag.
|
Just ran the ADflow mphys aero only examples and they work without custom surface families using this fix. Basically, if this option isn't set in the builder, then the parameter defaults to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New changes look great. I agree with terminating when there's missing data, and the two cases of BC normals flipping gets caught with the returnFail. Good to go for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional comments! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like you did not rerun tapenade after changing BCdata. Can you fix the issue? See here
Doing this now and re-running tests. Should be ready soon. |
Purpose
This PR fixes multiple bugs with properly passing the
meshFamilyGroup
from the MPhys wrapper to pyADflow function calls. Additionaly, this PR adds a way to add a custom design surface family calledmy_surf
using the options in the ADflow Builder. Previously, there was no way to add a custom family group before the mesh is set in the ADflow Builder'sinitialize
method.Now, the user can specify a custom set of design surfaces as a family group and add them to the
meshFamilyGroup
.This PR will also adress issue #148 where the
totalSubsonicInlet
functions was hanging on some procs when the normal vector was flipped for some cells along the boundary.Finally, this PR changes the way family groups are stored in the forward mode matrix free jacobian vector products. Previously, this function uses sets which received a different ordering on each proc. Now, we will use lists with if/else logic that mimics a set. This will keep the order of family groups the same on all procs while maintaining "set like" behavior with no duplicate entries.
Expected time until merged
This PR should take a week or less.This PR will take much longer than I originally thought.
Type of change
Testing
I tested this fix by wrapping the simple actuator zone test
test_actuator.py
with MPhys and adding a custom surface family. I included the FFD file, cgns mesh, and run script in the zip archive below.mphys_fix.zip
Checklist
flake8
andblack
to make sure the code adheres to PEP-8 and is consistently formatted