Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect node ghostRank values #525

Closed
klevzoff opened this issue Aug 9, 2019 · 10 comments
Closed

Incorrect node ghostRank values #525

klevzoff opened this issue Aug 9, 2019 · 10 comments
Labels
type: bug Something isn't working

Comments

@klevzoff
Copy link
Contributor

klevzoff commented Aug 9, 2019

Describe the bug
If a mesh with 5 layers (cells) in z direction is run with -z3, nodal ghost ranks are a little off.
Consider the following mesh: 1x1x5_LaplaceFEM.xml.txt
visit0000

Here, the bottom 2 elements are owned by rank 0, the middle one by rank 1, the top 2 by rank 2. This is correct. However, nodes C, I, O and U (global numbers 2, 8, 14 and 20, but I don't know how to display them in visit) will have a ghostRank value of -1 on both rank 0 and rank 1. Here is some debug output:

Rank 0: localToGlobal: { 0, 1, 2, 6, 7, 8, 12, 13, 14, 18, 19, 20, 3, 9, 15, 21 }
            ghostRank: { -2, -1, -1, -2, -1, -1, -2, -1, -1, -2, -1, -1, 1, 1, 1, 1 }
Rank 1: localToGlobal: { 2, 3, 8, 9, 14, 15, 20, 21, 1, 7, 13, 19, 4, 10, 16, 22 }
            ghostRank: { -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 2, 2, 2, 2 }
Rank 2: localToGlobal: { 3, 4, 5, 9, 10, 11, 15, 16, 17, 21, 22, 23, 2, 8, 14, 20 }
            ghostRank: { 1, -1, -2, 1, -1, -2, 1, -1, -2, 1, -1, -2, 1, 1, 1, 1 }

To Reproduce
Steps to reproduce the behavior:

  1. In LaplaceFEM::AssembleSystem, add debug output:
std::stringstream ss;
ss << "localToGlobal: " << mesh->getNodeManager()->m_localToGlobalMap << std::endl
   << "            ghostRank: " << mesh->getNodeManager()->m_ghostRank;
GEOS_LOG_RANK( ss.str() );
  1. Run the above file with mpirun -n3 ./bin/geosx -i 1x1x5_LaplaceFEM.xml -x1 -y1 -z3
  2. See the result

Expected behavior
In the above, rank 1 thinks it owns 8 nodes, but it should only own 4 (D, J, P, V on the image, or global indices 3, 9, 15, 21), according to assigned element ownership.

Additional context
I should probably add, that all communications seem to be set up more or less correctly, aside from these ghost rank values. In particular, data on these 4 nodes on rank 1 gets overwritten by data from rank 0 during sync, as it should. However, I need ghost rank to be correct, since I changed DofManager algorithm to rely on this value when assigning DoF ownership (so 50x10x5_LaplaceFEM test with 27 ranks does not work in #516)

@klevzoff klevzoff added type: bug Something isn't working priority: high labels Aug 9, 2019
@klevzoff
Copy link
Contributor Author

klevzoff commented Aug 9, 2019

The reason seems to be the way RebuildSyncLists and SetGhostRankForSenders work. They reset ghost rank for an object to -1 if this object is requested elsewhere, regardless of the fact that it might also be ghosted locally and not owned. In this case, the 4 nodes in question are requested by rank 2 and so are flagged as -1. I guess it's reason for the double sync flag in DofManager. It doesn't help me though :(

Instead of a workaround, can we fix ghostRank to be correct? E.g. just keep the original value in SetGhostRankForSenders, if it's nonnegative? The only reason it would be nonnegative is if the object came from a lower rank that considers it owned. It fixes the problem for my case.

Another question is how to detect these cases in general and correctly do field synchronization. I don't like the double sync flag in DofManager (and have actually removed it), because it's not the right place for it. A solver may be running explicit schemes and not using DofManager at all (and thus not checking the flag), but still suffer from this issue. It needs to be something in CommunicationTools itself.

@rrsettgast
Copy link
Member

ill take a look

@rrsettgast
Copy link
Member

A couple of notes. If you set the plotLevel in the output block like:

<Outputs>
    <Silo name="siloOutput" plotLevel="5" parallelThreads="32" plotFileRoot="plot"/>
    <Restart name="sidreRestart"/>
  </Outputs>

you get all the fields. Then you can plot the localToGlobal map.

In the figure below, we have the problem that you have specified. The left is rank0, the middle is rank1, and the right is rank2. The labels are:

  • localIndex is BLUE
  • ghostRank is RED
  • globalIndex is GREEN

Screen Shot 2019-08-08 at 10 52 52 PM

So the thing that I notice is this that on rank1, globalIndex (2,8,14,20) have a ghostRank of -1. However, they are obviously not locally owned since these nodes are locally owned by rank0. The fundamental problem is that rank2 has these nodes as ghosts....and they expect to get them from a neighbor....and rank0 isn't a neighbor. So the "solution" would be to throw an error when this sort of things happens (i.e. rank2 wants to ghost a node from rank1, but rank1 doesn't own the node), or to add another layer of local elements to rank1. For instance if I make this 6 elements, we have:

Screen Shot 2019-08-08 at 11 14 08 PM

...which appears to be correct.

@klevzoff
Copy link
Contributor Author

klevzoff commented Aug 9, 2019

Yeah. So this is a bit of an edge use case (having only a single layer of elements between ranks that are not neighbors), but we do currently have an integrated test where this happens. On my branch, I've added a check in SetGhostRankForSenders, so as to only set ghostRank=-1 when it is already negative. I also had to add a second SyncronizeFields call in LaplaceFEM to make it get correct results. This combo passes all tests, but always calling syncronization twice is an overkill. There used to be an error thrown, but it's commented out, probably because of said test.

So we can:

  • Change the test to 6 layers, and pretend this will never happen again
  • Re-enable the assertion and at least have a user-readable error
  • Instead of throwing an error, have a guard if statement, and always call SyncronizeFields twice, as I did
  • Add the guard statement plus some mechanism to detect when a second sync call is needed, so it can be done automatically as required. Maybe ghostRank being reset from positive to -1 is a good indicator that we should enable double sync. However, it's not clear if it needs to be done globally, or only selectively for the rank that detects it and its neghbors.

@rrsettgast
Copy link
Member

I am inclined to not use a redundant sync. If we want to properly get around this issue, then we need to expand the collection of neighbors for a rank when the situation arises. I am not sure what the best time to do this is. I think it may be enough to do this after the call to SetGhostRankForSenders since there is no data packing/unpacking to do at that point. This assumes that the object does in fact exist on rank2 in the 5 element example. I am not sure this will always be the case.

  1. We are on the sender when we get this information, so we can append a neighbor to the sender rank.
  2. send that information to the neighborRank so that it may create a new neighbor symmetric to the new neighbor on the sender rank.
  3. fill the ghostToSend and ghostsToReceive for the new neighbors....this information probably gets communicated in 2.

This will solve the issue for the coarse aggregates that will likely expose this issue.

@rrsettgast
Copy link
Member

@AntoineMazuyer any comments on this?
@francoishamon If we add non-adjacent neighbors then your issues with the coarse well should be resolved.

@francoishamon
Copy link
Contributor

@rrsettgast yes indeed if it becomes possible to add non-adjacent neighbors then I would not have to impose a very fine well mesh (this is the workaround that I am currently using). That would simplify the well implementation a little bit and would make the well solvers more robust, I think.

@AntoineMazuyer
Copy link
Contributor

Change the test to 6 layers, and pretend this will never happen again

In my experiences, problems that are not expected to happen will happen ^^. If I understand the issue, the communications seem to work fine but problems arise with the DOF Manager ?

This will solve the issue for the coarse aggregates that will likely expose this issue.

I will see once I will finish the transfer from AggregateElementSubRegion to AggregateElementRegion in PR #426, but I don't directly see why the coarse aggregates will expose this issue ?

@rrsettgast
Copy link
Member

Change the test to 6 layers, and pretend this will never happen again

In my experiences, problems that are not expected to happen will happen ^^. If I understand the issue, the communications seem to work fine but problems arise with the DOF Manager ?

The issue is that the ghosting is wrong. In the example, rank1 contains nodes it is treating as local in that it sends updated values to rank2...but it doesn't own those nodes. So they aren't correct. Thus when doing a sync, the values are sent from rank1 to rank2, but they are incorrect as the values are going to be update on rank1 by the values on rank0 (the actual owner of the nodes). So after a single sync, the values are incorrect on rank2, correct on rank1 because they were just updated from rank0. Now there needs to be a second sync to propagate the correct values from rank1 to rank2. So the communications do not work and require a double sync...which is about the worst thing we could do in terms of performance.

This will solve the issue for the coarse aggregates that will likely expose this issue.

I will see once I will finish the transfer from AggregateElementSubRegion to AggregateElementRegion in PR #426, but I don't directly see why the coarse aggregates will expose this issue ?

It depends on how you are doing the aggregate ghosting. I forgot what you did, but if you have your coarse aggregate with a single layer of elements on a rank, then depending on how you do the ghosting you can get hit with this issue.

@rrsettgast
Copy link
Member

Should have been Resolved by #776

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants