Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trying to set ghostRank of non-locally owned index #663

Closed
AntoineMazuyer opened this issue Nov 19, 2019 · 16 comments · Fixed by #776
Closed

trying to set ghostRank of non-locally owned index #663

AntoineMazuyer opened this issue Nov 19, 2019 · 16 comments · Fixed by #776
Assignees
Labels
type: bug Something isn't working

Comments

@AntoineMazuyer
Copy link
Contributor

AntoineMazuyer commented Nov 19, 2019

Describe the bug
I reproduced the staircase_3d test with a full tet mesh. I am trying to integrate it to the integrated tests. I ran it with 1, 2, 3 MPI cores without issue. But when using 4 MPI cores, I have this error message :

[ERROR in line 212 of file /home/amazuyer/devel/geosx/GEOSX/src/coreComponents/managers/ObjectManagerBase.hpp]
trying to set ghostRank of non-locally owned index: m_ghostRank[229]=0

To Reproduce
Steps to reproduce the behavior:

  1. Checkout feature/mazuyer/integratedPAMELAtest
  2. Go to src/coreComponents/physicsSolvers/fluidFlow/integratedTests/singlePhaseFlow
  3. Launch mpirun -np 4 geosx -i staircase_3d_tet.xml
  4. See error

Expected behavior
It shoulds run !

Screenshots
Expected result
staircase3d_tet

@AntoineMazuyer AntoineMazuyer added flag: help wanted type: bug Something isn't working labels Nov 19, 2019
@AntoineMazuyer AntoineMazuyer self-assigned this Nov 19, 2019
@klevzoff
Copy link
Contributor

klevzoff commented Nov 19, 2019

Possibly related to #525
I re-enabled this error message in #516, previously it was commented out.

Can you visualize what the MPI partitions look like prior to the failure? Are they read from PAMELA?

This error is supposed to come up only when you have very thin layers of cells in partitions, but maybe there is another pathological case we haven't thought of. The problem arises when a rank is acting as a sender for a ghosted node, because the rank that needs the node is not directly connected to the rank that owns it.

@rrsettgast
Copy link
Member

That error message needs to indicate what rank it is.

@AntoineMazuyer
Copy link
Contributor Author

AntoineMazuyer commented Nov 20, 2019

Think I have the answer. Here are the partitions :

image

( 0 : Yellow
1 : Green
2 : Pink
3 : Blue)

It's not contiguous O_o. I don't know how METIS was able to generate it.

@AntoineMazuyer
Copy link
Contributor Author

image
With 5 ranks, the domains are contiguous, but the problem still occurs

@MichaelSekachev
Copy link

MichaelSekachev commented Nov 20, 2019 via email

@rrsettgast
Copy link
Member

@AntoineMazuyer None of these partitions should cause a problem. Are you able to locate the problem and determine what the objects are that are causing the error? i.e. where in the mesh are they, and where is the offending object.

@rrsettgast
Copy link
Member

@AntoineMazuyer I cannot find feature/mazuyer/testUnstructuredGrid on GitHub. Are you sure you pushed it?

@AntoineMazuyer
Copy link
Contributor Author

@andrea-franceschini
Copy link
Contributor

Running with 4 ranks and the error check commented out, the simulation runs up to the end with no issues but plotting the results ... there is a hole!
This is the ghost ranks numbering for the element field. On the surface, one triangle is missing.

hole

@AntoineMazuyer
Copy link
Contributor Author

Following the discussion this morning with @joshua-white

Everything to reproduce it is on the original post. Even the mesh files and the xml file. I have merge with develop, problem is still here

@rrsettgast
Copy link
Member

@AntoineMazuyer

with this output:

Rank1 has Rank0 as neighbor.
Rank1 has Rank2 as neighbor.
Rank0 has Rank1 as neighbor.
Rank0 has Rank3 as neighbor.
Rank2 has Rank1 as neighbor.
Rank2 has Rank3 as neighbor.
Rank3 has Rank0 as neighbor.
Rank3 has Rank2 as neighbor.
***** ERROR
***** LOCATION: /usr/WS2/settgast/Codes/geosx/GEOSX_bugfix/src/coreComponents/managers/ObjectManagerBase.hpp:219
***** Controlling expression (should be false): m_ghostRank[index] >= 0
trying to set ghostRank of non-locally owned index: Rank 1, m_ghostRank[229]=0
***** ERROR
***** LOCATION: /usr/WS2/settgast/Codes/geosx/GEOSX_bugfix/src/coreComponents/managers/ObjectManagerBase.hpp:219
***** Controlling expression (should be false): m_ghostRank[index] >= 0
trying to set ghostRank of non-locally owned index: Rank 2, m_ghostRank[237]=1

and the following image:
Screen Shot 2020-02-08 at 12 05 13 AM
where:
rank0 -> red
rank1-> green
rank2-> dark blue
rank3 -> teal

So with a little bit of debugging info, the error is that rank1 is trying to send node 229, which is owned by rank0, over to rank2. The problem is that rank2 is not a "neighbor" of rank0. This happens because Metis doesn't recognize the relationship between rank0 and rank2, and thus there is no neighbor object generated in GEOSX. Is there a way to fix this in Metis?

@rrsettgast
Copy link
Member

@AntoineMazuyer I put in some code to add the missing neighbors. However, there is another problem with the decomposition.

Rank3 has Rank0 as neighbor.
Rank0 has Rank1 as neighbor.
Rank1 has Rank0 as neighbor.
Rank0 has Rank3 as neighbor.
Rank1 has Rank2 as neighbor.
Rank0 has Rank2 as extended neighbor.
Rank3 has Rank2 as neighbor.
Rank2 has Rank1 as neighbor.
Rank1 has Rank3 as extended neighbor.
Rank2 has Rank3 as neighbor.
Rank3 has Rank1 as extended neighbor.
Rank2 has Rank0 as extended neighbor.
***** ERROR
***** LOCATION: /usr/WS2/settgast/Codes/geosx/GEOSX_bugfix/src/coreComponents/managers/ObjectManagerBase.hpp:221
***** Controlling expression (should be false): m_ghostRank[index] >= 0
trying to set ghostRank of non-locally owned index: Rank 2, m_ghostRank[237]=1, m_localToGlobalMap[237]=22
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 4490048.0 ON quartz3 CANCELLED AT 2020-02-08T19:36:08 ***
srun: error: quartz3: tasks 0-3: Killed

Screen Shot 2020-02-08 at 7 37 13 PM

This is the problem we have been dodging for a while. When two ranks are separated by a single layer of elements from a third rank. It will be a little more intrusive to fix this one since it pretty much breaks the assumptions used to decompose the mesh. There is no longer any common local nodes between the two ranks. I think that we aren't going to be able to avoid this sort of thing with Metis...can we?

@AntoineMazuyer
Copy link
Contributor Author

@rrsettgast thanks for taking the time to debug this !To be sure I am understanding correctly the problem number 1, if I sum up, we are in this situation

image

Where by construction, a rank 2 never share an edge with a rank 0. So the neighbor list provided my Metis doesn't include this relation ? And in GEOSX, we want to have a stricter neighbor relations defined by "node connections" and not "edge connections" ?

@rrsettgast
Copy link
Member

rrsettgast commented Feb 10, 2020

@AntoineMazuyer
yes and no.
This was the original problem. Metis was not listing rank 0 and rank 2 as neighbors. I added some code s.t. rank 0 would look at the neighbors of its neighbors (rank 1 in this case), and that "fixed" this problem. I view this as a hack, and I would prefer that METIS recognize that rank0 and rank2 are neighbors if at all possible. Are we in agreement that rank0 and rank2 should be considered neighbors? Or perhaps we really should invest time into handling this case properly.

The current problem is shown in the last image i posted. Node237 is owned by the "green" rank. It is not at all part of the "light blue" rank....except that the "dark blue" rank sends it to the "light blue" rank as a ghost. So "light blue" and "green" share nothing (not even the node in question) in their original discretization, and the ghosting algorithm fails. This is because the ghosting algorithm works based of shared nodes...if there are no shared nodes, then there is no ghosting that occurs. To fix this we would have to add a section after the current ghosting algorithm to alter the send/receive lists s.t. the ownership of something like Node237 is set correctly. As it stands (with the hack mentioned above), if we disabled the check, the problem will run, but any synchronization of something like Node237 would require 2 calls to sync. One to get the correct value from "green" to "dark blue", then one from "dark blue" to "light blue". Does this make sense??

@AntoineMazuyer
Copy link
Contributor Author

Are we in agreement that rank0 and rank2 should be considered neighbors? Or perhaps we really should invest time into handling this case properly.

If we decide rank0 and rank2 are neighbors because our simulation routines work like that, then yes. I don't know if it is possible to enforce that in METIS, but I can have a look

[...] Does this make sense??

Yes I understand... Is it a big problem to do 2 calls to sync ?

@rrsettgast
Copy link
Member

Are we in agreement that rank0 and rank2 should be considered neighbors? Or perhaps we really should invest time into handling this case properly.

If we decide rank0 and rank2 are neighbors because our simulation routines work like that, then yes. I don't know if it is possible to enforce that in METIS, but I can have a look

I would pose the question like this. If two partitions share an object (i.e. a node), are they neighbors?

[...] Does this make sense??

Yes I understand... Is it a big problem to do 2 calls to sync ?

It is not good to do such a thing. However there are no variables kept at the node in flow calculations, so I think you wouldn't have to do anything....only when you need to have up to date variables at the nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants