Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPhys wrapper and boundary condition fail flag fixes #224

Merged
merged 25 commits into from
Nov 20, 2022

Conversation

lamkina
Copy link
Contributor

@lamkina lamkina commented Jun 1, 2022

Purpose

This PR fixes multiple bugs with properly passing the meshFamilyGroup from the MPhys wrapper to pyADflow function calls. Additionaly, this PR adds a way to add a custom design surface family called my_surf using the options in the ADflow Builder. Previously, there was no way to add a custom family group before the mesh is set in the ADflow Builder's initialize method.

Now, the user can specify a custom set of design surfaces as a family group and add them to the meshFamilyGroup.

This PR will also adress issue #148 where the totalSubsonicInlet functions was hanging on some procs when the normal vector was flipped for some cells along the boundary.

Finally, this PR changes the way family groups are stored in the forward mode matrix free jacobian vector products. Previously, this function uses sets which received a different ordering on each proc. Now, we will use lists with if/else logic that mimics a set. This will keep the order of family groups the same on all procs while maintaining "set like" behavior with no duplicate entries.

Expected time until merged

This PR should take a week or less.
This PR will take much longer than I originally thought.

Type of change

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (non-backwards-compatible fix or feature)
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Documentation update
  • Maintenance update
  • Other (please describe)

Testing

I tested this fix by wrapping the simple actuator zone test test_actuator.py with MPhys and adding a custom surface family. I included the FFD file, cgns mesh, and run script in the zip archive below.
mphys_fix.zip

Checklist

  • I have run flake8 and black to make sure the code adheres to PEP-8 and is consistently formatted
  • I have run unit and regression tests which pass locally with my changes
  • I have added new tests that prove my fix is effective or that my feature works
  • I have added necessary documentation

@lamkina lamkina requested a review from a team as a code owner June 1, 2022 18:07
@lamkina lamkina requested review from marcomangano and anilyil June 1, 2022 18:07
@codecov
Copy link

codecov bot commented Jun 1, 2022

Codecov Report

Merging #224 (147a53a) into main (ee39337) will decrease coverage by 0.05%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
- Coverage   41.35%   41.30%   -0.06%     
==========================================
  Files          13       13              
  Lines        3794     3806      +12     
==========================================
+ Hits         1569     1572       +3     
- Misses       2225     2234       +9     
Impacted Files Coverage Δ
adflow/mphys/mphys_adflow.py 0.00% <0.00%> (ø)
adflow/pyADflow.py 68.43% <80.00%> (+0.01%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Andrew Harold Robert Lamkin added 2 commits June 9, 2022 14:49
…ecause we catch this error in the python layer now
@lamkina
Copy link
Contributor Author

lamkina commented Jun 9, 2022

I fixed a bug that involves error checking for inflow boundary conditions. Previously, if there was a mesh failure near an inflow boundary condition that caused the normal vector on the bc face to flip direction, the proc that contained that bc cell would hang due to an mpi barrier call. With this fix, the flipped normal vector will throw a routineFailed error that we catch in the python layer after we update and check the bc's.

I plan to add testing for this error case as well as the mphys wrapper updates in a future PR.

@lamkina lamkina marked this pull request as draft June 19, 2022 15:14
@lamkina lamkina mentioned this pull request Aug 6, 2022
@lamkina lamkina changed the title MPhys wrapper mesh family group fix MPhys wrapper and boundary condition fail flag fixes Aug 6, 2022
@lamkina lamkina marked this pull request as ready for review August 6, 2022 19:37
Copy link
Contributor

@marcomangano marcomangano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks alright, I just need a few clarifications that might be added to the code for clarity

@@ -917,7 +917,7 @@ subroutine totalSubsonicInlet
&some faces.")

call returnFail("totalSubsonicInlet", errorMessage)
call mpi_barrier(ADflow_comm_world, ierr)
! call mpi_barrier(ADflow_comm_world, ierr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just delete this? Otw I would add a comment explaining why this was commented out - even something like "see PR #224"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will delete this it is not needed anymore. I can't see a reason we want a barrier instead of a fail flag here, because in the optimization loop on HPC, some procs will have the BC dataset and not others, so without the returnFail it will just hang on those procs when the BC normals are flipped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you verified that this does work as intended now? Would the code get stuck at any other barriers until it reaches that mpi allreduce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any issues since making this change on my fork, but I can look for the barriers associated with the boundary conditions and see if any of them should be calling returnFail in the same fashion.

@@ -1147,11 +1149,17 @@ def __init__(
restart_failed_analysis=False, # retry after failed analysis
err_on_convergence_fail=False, # raise an analysis error if the solver stalls
balance_group=None,
my_surfs=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here or below to explain this arg and where it gets passed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an explanation, the order MPhys was initializing ADflow made it impossible to set custom surface family groups. This is a bit of a hack for allowing users to add a custom surface family, but I think it might need to be expanded using a dictionary to add multiple custom surfaces. I will make these changes and add documentation to explain what's happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think other BC routines also have similar issues, its not just the one you fixed. In fact, the others just terminate the code instead of hanging. E.g. https://github.com/mdolab/adflow/blob/main/src/bcdata/BCData.F90#L969 Can we apply the same fix to the others? Are you sure the code does not hang and the processors end up with the correct fail flags when you do the allreduce in python?

I can confirm that the procs end up with the correct fail flags when I do the allreduce in the python layer. I've tested this functionality with my application on hpc, but also with smaller toy problems locally. I can look into adding tests for this behavior, but it might be involved to manufacture a toy problem at a small scale where the BC data is only on some procs and not others.

for f in self.curAP.evalFuncs:
# We need to use a list to have the same ordering on every proc, but
# we need 'set like' behavior so we don't include duplicate group names.
groupNames = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, it looks like the bwd version was already using a similar approach

@@ -210,6 +210,7 @@ subroutine integrateUserSurfaces_d(localValues, localValuesd, famList, sps)
allocate(recvBuffer1d(6*nPts), recvBuffer2d(3*nPts))
else
allocate(recvBuffer1(0), recvBuffer2(0))
allocate(recvBuffer1d(0), recvBuffer2d(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this taking care of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change from @anilyil that I included, but I believe this is allocating memory for the array used in the derivative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a change needed to get the fwd AD working for user surfaces, which was broken before this

Copy link
Contributor

@anilyil anilyil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor detail comments. One general comment: Have you tried the MPhys examples with this change? Is the behavior the same when we dont specify a custom surface?

@@ -210,6 +210,7 @@ subroutine integrateUserSurfaces_d(localValues, localValuesd, famList, sps)
allocate(recvBuffer1d(6*nPts), recvBuffer2d(3*nPts))
else
allocate(recvBuffer1(0), recvBuffer2(0))
allocate(recvBuffer1d(0), recvBuffer2d(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a change needed to get the fwd AD working for user surfaces, which was broken before this

@@ -917,7 +917,7 @@ subroutine totalSubsonicInlet
&some faces.")

call returnFail("totalSubsonicInlet", errorMessage)
call mpi_barrier(ADflow_comm_world, ierr)
! call mpi_barrier(ADflow_comm_world, ierr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you verified that this does work as intended now? Would the code get stuck at any other barriers until it reaches that mpi allreduce?

@anilyil
Copy link
Contributor

anilyil commented Nov 6, 2022

Also, I think other BC routines also have similar issues, its not just the one you fixed. In fact, the others just terminate the code instead of hanging. E.g. https://github.com/mdolab/adflow/blob/main/src/bcdata/BCData.F90#L969 Can we apply the same fix to the others? Are you sure the code does not hang and the processors end up with the correct fail flags when you do the allreduce in python?

@lamkina
Copy link
Contributor Author

lamkina commented Nov 6, 2022

I looked over BCData and I think the only spot where we might want to add a returnFail is here:

call terminate("BCDataSupersonicInflow", errorMessage)
. It seems that for a supersonic inflow BC we just terminate if the velocity ends up pointing in the wrong direction, however that seems like a place to throw a fail flag. @anilyil per your previous comment, does the terminate call at this line: https://github.com/mdolab/adflow/blob/main/src/bcdata/BCData.F90#L969 require us to throw a fail flag? I would think if one of the BC variables is not specified we want to terminate instead of throw a fail flag.

@lamkina
Copy link
Contributor Author

lamkina commented Nov 6, 2022

Just a few minor detail comments. One general comment: Have you tried the MPhys examples with this change? Is the behavior the same when we dont specify a custom surface?

Just ran the ADflow mphys aero only examples and they work without custom surface families using this fix. Basically, if this option isn't set in the builder, then the parameter defaults to None and there's logic in the ADflowBuilder initialize method that checks if the parameter is None to determine if it should add custom families or not.

anilyil
anilyil previously approved these changes Nov 19, 2022
Copy link
Contributor

@anilyil anilyil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes look great. I agree with terminating when there's missing data, and the two cases of BC normals flipping gets caught with the returnFail. Good to go for me.

Copy link
Contributor

@marcomangano marcomangano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional comments! LGTM

Copy link
Contributor

@marcomangano marcomangano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like you did not rerun tapenade after changing BCdata. Can you fix the issue? See here

@marcomangano marcomangano self-requested a review November 19, 2022 19:21
@lamkina
Copy link
Contributor Author

lamkina commented Nov 20, 2022

Actually, it looks like you did not rerun tapenade after changing BCdata. Can you fix the issue? See here

Doing this now and re-running tests. Should be ready soon.

@anilyil anilyil self-requested a review November 20, 2022 17:53
@anilyil anilyil merged commit cf1b4b8 into mdolab:main Nov 20, 2022
@lamkina lamkina deleted the mphys_fix branch November 20, 2022 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants