Incorrect node ghostRank values #525

klevzoff · 2019-08-09T03:41:39Z

Describe the bug
If a mesh with 5 layers (cells) in z direction is run with -z3, nodal ghost ranks are a little off.
Consider the following mesh: 1x1x5_LaplaceFEM.xml.txt

Here, the bottom 2 elements are owned by rank 0, the middle one by rank 1, the top 2 by rank 2. This is correct. However, nodes C, I, O and U (global numbers 2, 8, 14 and 20, but I don't know how to display them in visit) will have a ghostRank value of -1 on both rank 0 and rank 1. Here is some debug output:

Rank 0: localToGlobal: { 0, 1, 2, 6, 7, 8, 12, 13, 14, 18, 19, 20, 3, 9, 15, 21 }
            ghostRank: { -2, -1, -1, -2, -1, -1, -2, -1, -1, -2, -1, -1, 1, 1, 1, 1 }
Rank 1: localToGlobal: { 2, 3, 8, 9, 14, 15, 20, 21, 1, 7, 13, 19, 4, 10, 16, 22 }
            ghostRank: { -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 2, 2, 2, 2 }
Rank 2: localToGlobal: { 3, 4, 5, 9, 10, 11, 15, 16, 17, 21, 22, 23, 2, 8, 14, 20 }
            ghostRank: { 1, -1, -2, 1, -1, -2, 1, -1, -2, 1, -1, -2, 1, 1, 1, 1 }

To Reproduce
Steps to reproduce the behavior:

In LaplaceFEM::AssembleSystem, add debug output:

std::stringstream ss;
ss << "localToGlobal: " << mesh->getNodeManager()->m_localToGlobalMap << std::endl
   << "            ghostRank: " << mesh->getNodeManager()->m_ghostRank;
GEOS_LOG_RANK( ss.str() );

Run the above file with mpirun -n3 ./bin/geosx -i 1x1x5_LaplaceFEM.xml -x1 -y1 -z3
See the result

Expected behavior
In the above, rank 1 thinks it owns 8 nodes, but it should only own 4 (D, J, P, V on the image, or global indices 3, 9, 15, 21), according to assigned element ownership.

Additional context
I should probably add, that all communications seem to be set up more or less correctly, aside from these ghost rank values. In particular, data on these 4 nodes on rank 1 gets overwritten by data from rank 0 during sync, as it should. However, I need ghost rank to be correct, since I changed DofManager algorithm to rely on this value when assigning DoF ownership (so 50x10x5_LaplaceFEM test with 27 ranks does not work in #516)

The text was updated successfully, but these errors were encountered:

klevzoff · 2019-08-09T04:27:49Z

The reason seems to be the way RebuildSyncLists and SetGhostRankForSenders work. They reset ghost rank for an object to -1 if this object is requested elsewhere, regardless of the fact that it might also be ghosted locally and not owned. In this case, the 4 nodes in question are requested by rank 2 and so are flagged as -1. I guess it's reason for the double sync flag in DofManager. It doesn't help me though :(

Instead of a workaround, can we fix ghostRank to be correct? E.g. just keep the original value in SetGhostRankForSenders, if it's nonnegative? The only reason it would be nonnegative is if the object came from a lower rank that considers it owned. It fixes the problem for my case.

Another question is how to detect these cases in general and correctly do field synchronization. I don't like the double sync flag in DofManager (and have actually removed it), because it's not the right place for it. A solver may be running explicit schemes and not using DofManager at all (and thus not checking the flag), but still suffer from this issue. It needs to be something in CommunicationTools itself.

rrsettgast · 2019-08-09T05:36:14Z

ill take a look

rrsettgast · 2019-08-09T06:14:40Z

A couple of notes. If you set the plotLevel in the output block like:

<Outputs>
    <Silo name="siloOutput" plotLevel="5" parallelThreads="32" plotFileRoot="plot"/>
    <Restart name="sidreRestart"/>
  </Outputs>

you get all the fields. Then you can plot the localToGlobal map.

In the figure below, we have the problem that you have specified. The left is rank0, the middle is rank1, and the right is rank2. The labels are:

localIndex is BLUE
ghostRank is RED
globalIndex is GREEN

So the thing that I notice is this that on rank1, globalIndex (2,8,14,20) have a ghostRank of -1. However, they are obviously not locally owned since these nodes are locally owned by rank0. The fundamental problem is that rank2 has these nodes as ghosts....and they expect to get them from a neighbor....and rank0 isn't a neighbor. So the "solution" would be to throw an error when this sort of things happens (i.e. rank2 wants to ghost a node from rank1, but rank1 doesn't own the node), or to add another layer of local elements to rank1. For instance if I make this 6 elements, we have:

...which appears to be correct.

klevzoff · 2019-08-09T06:30:45Z

Yeah. So this is a bit of an edge use case (having only a single layer of elements between ranks that are not neighbors), but we do currently have an integrated test where this happens. On my branch, I've added a check in SetGhostRankForSenders, so as to only set ghostRank=-1 when it is already negative. I also had to add a second SyncronizeFields call in LaplaceFEM to make it get correct results. This combo passes all tests, but always calling syncronization twice is an overkill. There used to be an error thrown, but it's commented out, probably because of said test.

So we can:

Change the test to 6 layers, and pretend this will never happen again
Re-enable the assertion and at least have a user-readable error
Instead of throwing an error, have a guard if statement, and always call SyncronizeFields twice, as I did
Add the guard statement plus some mechanism to detect when a second sync call is needed, so it can be done automatically as required. Maybe ghostRank being reset from positive to -1 is a good indicator that we should enable double sync. However, it's not clear if it needs to be done globally, or only selectively for the rank that detects it and its neghbors.

rrsettgast · 2019-08-09T07:23:46Z

I am inclined to not use a redundant sync. If we want to properly get around this issue, then we need to expand the collection of neighbors for a rank when the situation arises. I am not sure what the best time to do this is. I think it may be enough to do this after the call to SetGhostRankForSenders since there is no data packing/unpacking to do at that point. This assumes that the object does in fact exist on rank2 in the 5 element example. I am not sure this will always be the case.

We are on the sender when we get this information, so we can append a neighbor to the sender rank.
send that information to the neighborRank so that it may create a new neighbor symmetric to the new neighbor on the sender rank.
fill the ghostToSend and ghostsToReceive for the new neighbors....this information probably gets communicated in 2.

This will solve the issue for the coarse aggregates that will likely expose this issue.

rrsettgast · 2019-08-09T17:53:32Z

@AntoineMazuyer any comments on this?
@francoishamon If we add non-adjacent neighbors then your issues with the coarse well should be resolved.

francoishamon · 2019-08-09T17:58:57Z

@rrsettgast yes indeed if it becomes possible to add non-adjacent neighbors then I would not have to impose a very fine well mesh (this is the workaround that I am currently using). That would simplify the well implementation a little bit and would make the well solvers more robust, I think.

AntoineMazuyer · 2019-08-09T18:27:46Z

Change the test to 6 layers, and pretend this will never happen again

In my experiences, problems that are not expected to happen will happen ^^. If I understand the issue, the communications seem to work fine but problems arise with the DOF Manager ?

This will solve the issue for the coarse aggregates that will likely expose this issue.

I will see once I will finish the transfer from AggregateElementSubRegion to AggregateElementRegion in PR #426, but I don't directly see why the coarse aggregates will expose this issue ?

rrsettgast · 2019-08-09T23:01:28Z

Change the test to 6 layers, and pretend this will never happen again

In my experiences, problems that are not expected to happen will happen ^^. If I understand the issue, the communications seem to work fine but problems arise with the DOF Manager ?

The issue is that the ghosting is wrong. In the example, rank1 contains nodes it is treating as local in that it sends updated values to rank2...but it doesn't own those nodes. So they aren't correct. Thus when doing a sync, the values are sent from rank1 to rank2, but they are incorrect as the values are going to be update on rank1 by the values on rank0 (the actual owner of the nodes). So after a single sync, the values are incorrect on rank2, correct on rank1 because they were just updated from rank0. Now there needs to be a second sync to propagate the correct values from rank1 to rank2. So the communications do not work and require a double sync...which is about the worst thing we could do in terms of performance.

This will solve the issue for the coarse aggregates that will likely expose this issue.

I will see once I will finish the transfer from AggregateElementSubRegion to AggregateElementRegion in PR #426, but I don't directly see why the coarse aggregates will expose this issue ?

It depends on how you are doing the aggregate ghosting. I forgot what you did, but if you have your coarse aggregate with a single layer of elements on a rank, then depending on how you do the ghosting you can get hit with this issue.

rrsettgast · 2020-07-14T05:02:11Z

Should have been Resolved by #776

klevzoff added type: bug Something isn't working priority: high labels Aug 9, 2019

klevzoff mentioned this issue Aug 16, 2019

DofManager changes and use in solvers #516

Merged

klevzoff mentioned this issue Nov 19, 2019

trying to set ghostRank of non-locally owned index #663

Closed

rrsettgast closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect node ghostRank values #525

Incorrect node ghostRank values #525

klevzoff commented Aug 9, 2019 •

edited

Loading

klevzoff commented Aug 9, 2019 •

edited

Loading

rrsettgast commented Aug 9, 2019

rrsettgast commented Aug 9, 2019

klevzoff commented Aug 9, 2019 •

edited

Loading

rrsettgast commented Aug 9, 2019

rrsettgast commented Aug 9, 2019

francoishamon commented Aug 9, 2019

AntoineMazuyer commented Aug 9, 2019

rrsettgast commented Aug 9, 2019

rrsettgast commented Jul 14, 2020

Incorrect node ghostRank values #525

Incorrect node ghostRank values #525

Comments

klevzoff commented Aug 9, 2019 • edited Loading

klevzoff commented Aug 9, 2019 • edited Loading

rrsettgast commented Aug 9, 2019

rrsettgast commented Aug 9, 2019

klevzoff commented Aug 9, 2019 • edited Loading

rrsettgast commented Aug 9, 2019

rrsettgast commented Aug 9, 2019

francoishamon commented Aug 9, 2019

AntoineMazuyer commented Aug 9, 2019

rrsettgast commented Aug 9, 2019

rrsettgast commented Jul 14, 2020

klevzoff commented Aug 9, 2019 •

edited

Loading

klevzoff commented Aug 9, 2019 •

edited

Loading

klevzoff commented Aug 9, 2019 •

edited

Loading