Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI issue using Surface Generator #841

Open
andrea-franceschini opened this issue Mar 18, 2020 · 18 comments · May be fixed by #3419
Open

MPI issue using Surface Generator #841

andrea-franceschini opened this issue Mar 18, 2020 · 18 comments · May be fixed by #3419
Assignees
Labels
type: bug Something isn't working

Comments

@andrea-franceschini
Copy link
Contributor

Bug description
Running this test case, that needs this mesh file, with more than 1 process, e.g. mpirun -np 2 geosx -x 2, the code waits forever. If I interrupt it with Ctrl+C, I have:

Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: 
Frame 3: opal_progress
Frame 4: ompi_request_default_wait
Frame 5: ompi_coll_base_sendrecv_actual
Frame 6: ompi_coll_base_allgather_intra_two_procs
Frame 7: MPI_Allgather
Frame 8: void geosx::MpiWrapper::allGather<long>(long, LvArray::Array<long, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>&, ompi_communicator_t*)
Frame 9: geosx::CommunicationTools::AssignNewGlobalIndices(geosx::ObjectManagerBase&, std::set<long, std::less<long>, std::allocator<long> > const&)
Frame 10: geosx::SurfaceGenerator::SeparationDriver(geosx::DomainPartition*, geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, int, int, bool, double)
Frame 11: geosx::SurfaceGenerator::SolverStep(double const&, double const&, int, geosx::DomainPartition*)
Frame 12: geosx::SurfaceGenerator::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 13: geosx::EventBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 14: geosx::EventManager::Run(geosx::dataRepository::Group*)
Frame 15: geosx::ProblemManager::RunSimulation()
Frame 16: main
Frame 17: __libc_start_main
Frame 18: _start

As now, I'm using the branch #799, but the surfaceGenerator kernel is the same as in develop.

Platform:

  • Machine: Ubuntu 18.04
  • Compiler: gcc 7.4.0
  • Cmake: 3.10.2

Note
The extensions are: xml for the main input and msh for the mesh. github forced me to use txt.

@andrea-franceschini
Copy link
Contributor Author

I created a simpler version that should just duplicate the nodes alone a fracture of an unstructured mesh. Because the surfaceGenerator function call TwoPointFluxApproximation that requires the pressure field to be defined:
https://github.com/GEOSX/GEOSX/blob/886e9107d0e2a34b9616bfabeee85a59cc95634d/src/coreComponents/finiteVolume/TwoPointFluxApproximation.cpp#L295
any run of geosx with this input has to fail with this error:

** StackTrace of 13 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: cxx_utilities::handler1(int)
Frame 3: LvArray::Array<double, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>& geosx::dataRepository::Group::getReference<LvArray::Array<double, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
Frame 4: LvArray::Array<double, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>& geosx::dataRepository::Group::getReference<LvArray::Array<double, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer> >(char const*)
Frame 5: geosx::TwoPointFluxApproximation::addToFractureStencil(geosx::DomainPartition&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
Frame 6: geosx::SurfaceGenerator::SolverStep(double const&, double const&, int, geosx::DomainPartition*)
Frame 7: geosx::SurfaceGenerator::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 8: geosx::EventBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 9: geosx::EventManager::Run(geosx::dataRepository::Group*)
Frame 10: geosx::ProblemManager::RunSimulation()
Frame 11: main
Frame 12: __libc_start_main
Frame 13: _start

Nevertheless, with mpirun -np 3 geosx -x 3 -i file I have:

** StackTrace of 11 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: cxx_utilities::handler1(int)
Frame 3: geosx::verifyGhostingConsistency(geosx::ObjectManagerBase const&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> > const&)
Frame 4: geosx::CommunicationTools::FindGhosts(geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, bool)
Frame 5: geosx::DomainPartition::SetupCommunications(bool)
Frame 6: geosx::ProblemManager::InitializePostSubGroups(geosx::dataRepository::Group*)
Frame 7: geosx::dataRepository::Group::Initialize(geosx::dataRepository::Group*)
Frame 8: geosx::ProblemManager::ProblemSetup()
Frame 9: main
Frame 10: __libc_start_main
Frame 11: _start

@andrea-franceschini
Copy link
Contributor Author

To be more precise, I prepared this test case, that defines the pressure field, in such a way that the simulation can reach the end.
Running with the configuration mpirun -np 3 geosx -x 3 -i file.xml, I have:

Rank 1: Expected to send 0 non local ghosts to rank 2 but sending 8
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 1: Encountered a ghosting inconsistency in nodeManager
Rank 2: Expected to send 0 non local ghosts to rank 0 but sending 4
Rank 2: Expected to send 0 non local ghosts to rank 1 but sending 8
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 2: Encountered a ghosting inconsistency in nodeManager
Received signal 1: Hangup

@andrea-franceschini
Copy link
Contributor Author

I really don't know what can be the cause, but ... can it be something similar to #663?

@andrea-franceschini
Copy link
Contributor Author

andrea-franceschini commented Mar 23, 2020

I realized that the issue is not related with Surface Generator. The problem can be reproduced even without the SurfaceGenerator step. Running mpirun -np 3 geosx -i file this pair of xml and msh, I have:

***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 2: Encountered a ghosting inconsistency in nodeManager
Rank 1: Expected to send 0 non local ghosts to rank 2 but sending 7
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 1: Encountered a ghosting inconsistency in nodeManager
Received signal 1: Hangup

** StackTrace of 10 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: geosx::verifyGhostingConsistency(geosx::ObjectManagerBase const&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> > const&)
Frame 3: geosx::CommunicationTools::FindGhosts(geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, bool)
Frame 4: geosx::DomainPartition::SetupCommunications(bool)
Frame 5: geosx::ProblemManager::InitializePostSubGroups(geosx::dataRepository::Group*)
Frame 6: geosx::dataRepository::Group::Initialize(geosx::dataRepository::Group*)
Frame 7: geosx::ProblemManager::ProblemSetup()
Frame 8: main
Frame 9: __libc_start_main
Frame 10: _start

The partition is this:
partition.png

Any idea on the possible problem? It seems to be related on how GEOSX handles the partitioning of an unstructured mesh.

@corbett5
Copy link
Contributor

@AF1990 I have a PR I hope to finish up today or tomorrow but then I'll look into it. This seems very similar to #633.

@corbett5
Copy link
Contributor

@AF1990 I have some good news and some bad news. The good news is that I fixed the error you were getting related to #633. The bad news is that this error was a false positive and I almost certainly didn't fix the issue you are having with the Surface Generator.

@andrea-franceschini
Copy link
Contributor Author

Working with this mesh and 2 processes, I realized that the fracture nodes are properly split only on one process (rank 0), while it seems that rank 1 sees the nodes on the interface between the two processes as not doubled.

region
In this figure, there is the ghost rank for all the elements,
fracture
while here there is the ghost rank for the fracture. The highlighted nodes are the problem. They are on the interface between rank 0 and rank 1 but only rank 0 see them as doubled (so the fracture is open), while rank 1 sees them still not double (and the fracture is closed).

This creates an inconsistency between ranks and is not physically correct (all the fracture is open, except for the two right-most and left-most edges).

The fracture is created as a pre-step before the simulation and never changes.

@rrsettgast, am I using SurfaceGenerator in a wrong way? Have you ever observed something similar?

@rrsettgast
Copy link
Member

@AF1990 I will have to take a look. This looks like a pretty substantial bug, but I have seen this case work previously. Perhaps we made a change and did not have coverage for this case. Can you send me your input file?

@andrea-franceschini
Copy link
Contributor Author

andrea-franceschini commented Mar 31, 2020

Yes, that's the file. It's a simple flow simulation that, with develop branch, and mpirun -np 2 produces:

Rank 0: 0 3 37 465 81 1639 1676 1711 1654 
Rank 0: 1 38 3 81 467 1655 1713 1676 1639 
Rank 0: 2 82 466 80 7 1677 1640 1675 1712 
Rank 0: 3 468 82 7 83 1714 1678 1640 1677 
Rank 0: 4 25 24 439 441 1642 1687 1685 1641 
Rank 0: 6 26 25 441 443 1643 1689 1687 1642 
Rank 0: 7 27 26 443 445 1644 1691 1689 1643 
Rank 0: 8 28 27 445 447 1645 1693 1691 1644 
Rank 0: 9 29 28 447 449 1646 1695 1693 1645 
Rank 0: 10 30 29 449 451 1647 1697 1695 1646 
Rank 0: 11 31 30 451 453 1648 1699 1697 1647 
Rank 0: 12 32 31 453 455 1649 1701 1699 1648 
Rank 0: 13 33 32 455 457 1650 1703 1701 1649 
Rank 0: 14 34 33 457 459 1651 1705 1703 1650 
Rank 0: 15 35 34 459 461 1652 1707 1705 1651 
Rank 0: 16 36 35 461 463 1653 1709 1707 1652 
Rank 0: 17 37 36 463 465 1654 1711 1709 1653 
Rank 0: 18 39 38 467 469 1656 1715 1713 1655 
Rank 0: 19 40 39 469 471 1657 1717 1715 1656 
Rank 0: 20 41 40 471 473 1658 1719 1717 1657 
Rank 0: 21 42 41 473 475 1659 1721 1719 1658 
Rank 0: 22 43 42 475 477 1660 1723 1721 1659 
Rank 0: 23 44 43 477 479 1661 1725 1723 1660 
Rank 0: 24 4 44 479 65 4 65 1725 1661 
Rank 0: 25 442 440 67 68 1688 1663 1662 1686 
Rank 0: 27 444 442 68 69 1690 1664 1663 1688 
Rank 0: 28 446 444 69 70 1692 1665 1664 1690 
Rank 0: 29 448 446 70 71 1694 1666 1665 1692 
Rank 0: 30 450 448 71 72 1696 1667 1666 1694 
Rank 0: 31 452 450 72 73 1698 1668 1667 1696 
Rank 0: 32 454 452 73 74 1700 1669 1668 1698 
Rank 0: 33 456 454 74 75 1702 1670 1669 1700 
Rank 0: 34 458 456 75 76 1704 1671 1670 1702 
Rank 0: 35 460 458 76 77 1706 1672 1671 1704 
Rank 0: 36 462 460 77 78 1708 1673 1672 1706 
Rank 0: 37 464 462 78 79 1710 1674 1673 1708 
Rank 0: 38 466 464 79 80 1712 1675 1674 1710 
Rank 0: 39 81 465 466 82 1676 1677 1712 1711 
Rank 0: 40 467 81 82 468 1713 1714 1677 1676 
Rank 0: 41 470 468 83 84 1716 1679 1678 1714 
Rank 0: 42 472 470 84 85 1718 1680 1679 1716 
Rank 0: 43 474 472 85 86 1720 1681 1680 1718 
Rank 0: 44 476 474 86 87 1722 1682 1681 1720 
Rank 0: 45 478 476 87 88 1724 1683 1682 1722 
Rank 0: 46 480 478 88 89 1726 1684 1683 1724 
Rank 0: 47 66 480 89 6 66 6 1684 1726 
Rank 0: 48 441 439 440 442 1687 1688 1686 1685 
Rank 0: 50 443 441 442 444 1689 1690 1688 1687 
Rank 0: 51 445 443 444 446 1691 1692 1690 1689 
Rank 0: 52 447 445 446 448 1693 1694 1692 1691 
Rank 0: 53 449 447 448 450 1695 1696 1694 1693 
Rank 0: 54 451 449 450 452 1697 1698 1696 1695 
Rank 0: 55 453 451 452 454 1699 1700 1698 1697 
Rank 0: 56 455 453 454 456 1701 1702 1700 1699 
Rank 0: 57 457 455 456 458 1703 1704 1702 1701 
Rank 0: 58 459 457 458 460 1705 1706 1704 1703 
Rank 0: 59 461 459 460 462 1707 1708 1706 1705 
Rank 0: 60 463 461 462 464 1709 1710 1708 1707 
Rank 0: 61 465 463 464 466 1711 1712 1710 1709 
Rank 0: 62 469 467 468 470 1715 1716 1714 1713 
Rank 0: 63 471 469 470 472 1717 1718 1716 1715 
Rank 0: 64 473 471 472 474 1719 1720 1718 1717 
Rank 0: 65 475 473 474 476 1721 1722 1720 1719 
Rank 0: 66 477 475 476 478 1723 1724 1722 1721 
Rank 0: 67 479 477 478 480 1725 1726 1724 1723 
Rank 0: 68 65 479 480 66 65 66 1726 1725 

and

Rank 1: 0 4 37 405 67 1654 1679 1702 1662 
Rank 1: 1 38 4 67 407 1663 1704 1679 1654 
Rank 1: 2 68 406 66 7 1680 1655 1678 1703 
Rank 1: 3 408 68 7 69 1705 1681 1655 1680 
Rank 1: 4 31 3 58 393 1656 1690 58 3
Rank 1: 5 32 31 393 395 1657 1692 1690 1656 
Rank 1: 6 33 32 395 397 1658 1694 1692 1657 
Rank 1: 7 34 33 397 399 1659 1696 1694 1658 
Rank 1: 8 35 34 399 401 1660 1698 1696 1659 
Rank 1: 9 36 35 401 403 1661 1700 1698 1660 
Rank 1: 10 37 36 403 405 1662 1702 1700 1661 
Rank 1: 11 39 38 407 409 1664 1706 1704 1663 
Rank 1: 12 40 39 409 411 1665 1708 1706 1664 
Rank 1: 13 41 40 411 413 1666 1710 1708 1665 
Rank 1: 14 42 41 413 415 1667 1712 1710 1666 
Rank 1: 15 43 42 415 417 1668 1714 1712 1667 
Rank 1: 16 44 43 417 419 1669 1716 1714 1668 
Rank 1: 17 45 44 419 421 1670 1718 1716 1669 
Rank 1: 18 46 45 421 423 1671 1720 1718 1670 
Rank 1: 19 1320 46 423 1358 1722 1728 423 46     <----
Rank 1: 20 394 59 6 60 1691 1672 6 59 
Rank 1: 21 396 394 60 61 1693 1673 1672 1691 
Rank 1: 22 398 396 61 62 1695 1674 1673 1693 
Rank 1: 23 400 398 62 63 1697 1675 1674 1695 
Rank 1: 24 402 400 63 64 1699 1676 1675 1697 
Rank 1: 25 404 402 64 65 1701 1677 1676 1699 
Rank 1: 26 406 404 65 66 1703 1678 1677 1701 
Rank 1: 27 67 405 406 68 1679 1680 1703 1702 
Rank 1: 28 407 67 68 408 1704 1705 1680 1679 
Rank 1: 29 410 408 69 70 1707 1682 1681 1705 
Rank 1: 30 412 410 70 71 1709 1683 1682 1707 
Rank 1: 31 414 412 71 72 1711 1684 1683 1709 
Rank 1: 32 416 414 72 73 1713 1685 1684 1711 
Rank 1: 33 418 416 73 74 1715 1686 1685 1713 
Rank 1: 34 420 418 74 75 1717 1687 1686 1715 
Rank 1: 35 422 420 75 76 1719 1688 1687 1717 
Rank 1: 36 424 422 76 77 1721 1689 1688 1719 
Rank 1: 37 1359 424 77 1321 1729 1725 77 424     <----
Rank 1: 38 393 58 59 394 1690 1691 59 58 
Rank 1: 39 395 393 394 396 1692 1693 1691 1690 
Rank 1: 40 397 395 396 398 1694 1695 1693 1692 
Rank 1: 41 399 397 398 400 1696 1697 1695 1694 
Rank 1: 42 401 399 400 402 1698 1699 1697 1696 
Rank 1: 43 403 401 402 404 1700 1701 1699 1698 
Rank 1: 44 405 403 404 406 1702 1703 1701 1700 
Rank 1: 45 409 407 408 410 1706 1707 1705 1704 
Rank 1: 46 411 409 410 412 1708 1709 1707 1706 
Rank 1: 47 413 411 412 414 1710 1711 1709 1708 
Rank 1: 48 415 413 414 416 1712 1713 1711 1710 
Rank 1: 49 417 415 416 418 1714 1715 1713 1712 
Rank 1: 50 419 417 418 420 1716 1717 1715 1714 
Rank 1: 51 421 419 420 422 1718 1719 1717 1716 
Rank 1: 52 423 421 422 424 1720 1721 1719 1718 
Rank 1: 53 1358 423 424 1359 1728 1729 424 423     <----

where the first number is the element index, while the 8 following are nodes of top/bottom faces.
You can see that elements 24, 47 and 68 for rank 0 and 4, 20 and 38 for rank1 have the same nodes on both top and bottom faces and that's right!

The problem is that elements 19, 37 and 53 on rank1 have not split nodes too!!!

@andrea-franceschini
Copy link
Contributor Author

I realized that the issue is not related with Surface Generator. The problem can be reproduced even without the SurfaceGenerator step. Running mpirun -np 3 geosx -i file this pair of xml and msh, I have:

***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 2: Encountered a ghosting inconsistency in nodeManager
Rank 1: Expected to send 0 non local ghosts to rank 2 but sending 7
***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/mpiCommunications/CommunicationTools.cpp:526
***** Controlling expression (should be false): error
***** Rank 1: Encountered a ghosting inconsistency in nodeManager
Received signal 1: Hangup

** StackTrace of 10 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: geosx::verifyGhostingConsistency(geosx::ObjectManagerBase const&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> > const&)
Frame 3: geosx::CommunicationTools::FindGhosts(geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, bool)
Frame 4: geosx::DomainPartition::SetupCommunications(bool)
Frame 5: geosx::ProblemManager::InitializePostSubGroups(geosx::dataRepository::Group*)
Frame 6: geosx::dataRepository::Group::Initialize(geosx::dataRepository::Group*)
Frame 7: geosx::ProblemManager::ProblemSetup()
Frame 8: main
Frame 9: __libc_start_main
Frame 10: _start

The partition is this:
partition.png

Any idea on the possible problem? It seems to be related on how GEOSX handles the partitioning of an unstructured mesh.

This is solved by #864. Nevertheless ... there's another problem (with the same settings):

***** ERROR
***** LOCATION: /home/franc90/code/geosx/GEOSX/src/coreComponents/managers/ObjectManagerBase.hpp:492
***** Controlling expression (should be false): !allValuesMapped
***** Rank 2: some values of unmappedIndices were not used
Received signal 1: Hangup

** StackTrace of 12 frames **
Frame 1: cxx_utilities::handler(int, int, int)
Frame 2: void geosx::ObjectManagerBase::FixUpDownMaps<geosx::InterObjectRelation<LvArray::ArrayOfArrays<long, long> > >(geosx::InterObjectRelation<LvArray::ArrayOfArrays<long, long> >&, geosx::mapBase<long, LvArray::Array<long long, 1, camp::int_seq<long, 0l>, long, LvArray::NewChaiBuffer>, std::integral_constant<bool, true> >&, bool)
Frame 3: geosx::FaceManager::FixUpDownMaps(bool)
Frame 4: geosx::ParallelTopologyChange::SynchronizeTopologyChange(geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, geosx::ModifiedObjectLists&, geosx::ModifiedObjectLists&, int)
Frame 5: geosx::SurfaceGenerator::SeparationDriver(geosx::DomainPartition*, geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, int, int, bool, double)
Frame 6: geosx::SurfaceGenerator::SolverStep(double const&, double const&, int, geosx::DomainPartition*)
Frame 7: geosx::SurfaceGenerator::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 8: geosx::EventBase::Execute(double, double, int, int, double, geosx::dataRepository::Group*)
Frame 9: geosx::EventManager::Run(geosx::dataRepository::Group*)
Frame 10: main
Frame 11: __libc_start_main
Frame 12: _start
====

@joshua-white
Copy link
Contributor

@AF1990 What is the status of this issue?

@andrea-franceschini
Copy link
Contributor Author

For unstructured grids, the parallel surface generator still has problems, such as the inconsistency across ranks of the splitting.

@cssherman
Copy link
Contributor

@rrsettgast @joshua-white - I think that I'm running into this issue as well. I have a number of external meshes that conform to one or more faults surfaces. I'm getting the same message as @andrea-franceschini when I try to split the mesh in parallel runs. This is one of the simple meshes that I'm testing, which has a 45 degree fault cutting through it:

image

Any thoughts on how to address this? I've attached the example xml file and mesh here. I get the error with the arguments `-x 2 -y 2 -z 2
test.zip

`

@cssherman
Copy link
Contributor

@rrsettgast - I've created a 2x2x2 mesh with a vertical fault that shows the same behavior (see attached). The mesh appears to be split correctly if there is a partition aligned with the surface (-x 2 , -y 2 , or -z 2). However if a partition corner is on the surface (-x 2 -y 2 , -x 2 -z 2 , or -y 2 -z 2), then we get the error:

***** ERROR
***** LOCATION: /usr/WS2/sherman/GEOSX/src/coreComponents/mesh/ObjectManagerBase.hpp:1024
***** Controlling expression (should be false): !allValuesMapped
***** Rank 2: some values of unmappedIndices were not used

** StackTrace of 10 frames **
Frame 0: geosx::EdgeManager::fixUpDownMaps(bool) 
Frame 1: geosx::ParallelTopologyChange::synchronizeTopologyChange(geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, geosx::ModifiedObjectLists&, geosx::ModifiedObjectLists&, int) 
Frame 2: geosx::SurfaceGenerator::separationDriver(geosx::DomainPartition&, geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, int, int, bool, double) 
Frame 3: geosx::SurfaceGenerator::solverStep(double const&, double const&, int, geosx::DomainPartition&) 
Frame 4: geosx::SurfaceGenerator::execute(double, double, int, int, double, geosx::DomainPartition&) 
Frame 5: geosx::EventBase::execute(double, double, int, int, double, geosx::DomainPartition&) 
Frame 6: geosx::EventManager::run(geosx::DomainPartition&) 
Frame 7: geosx::GeosxState::run() 
Frame 8: main 
Frame 9: __libc_start_main 
Frame 10: /g/g17/sherman/GEOS/geosx/GEOSX/build-quartz-gcc@8.1.0-release/bin/geosx 
=====

Note: the problem doesn't run for 8 partitions (-x 2 -y 2 -z 2) due to an error that occurs in pamela for such a small mesh.

small_test.zip

@cssherman cssherman mentioned this issue Jul 30, 2021
10 tasks
@jhuang2601
Copy link
Contributor

jhuang2601 commented Aug 25, 2021

@joshua-white @andrea-franceschini @cssherman @CusiniM @herve-gross This old issue has never been fixed. Recently, I hit the same road block when running the single fracture compression problem with Lagrangian Contact Solver. In this case, external mesh is used and it seems like that the SurfaceGenerator is incompatible with unstructured mesh (PAMELAMeshGenerator), if running with multiple ranks.

By plotting both the silo and vtk output of shear displacement and comparing with the analytical solution, same issue is observed for the case running with 2 ranks, which confirms that it is not related to the output format. Moreover, this anomaly happens at the partition boundary, which suggests that the parallel surface generator does not work properly with unstructured mesh in parallel.
image

SingleFracCompression.zip

@cssherman
Copy link
Contributor

@jhuang2601 - Agreed. I tried to look at this with the example I've included above, and am suspicious that it is an issue with Metis partitioning (I couldn't nail anything down though). I'm curious to see if the vtk mesh generator will be subject to the same issue...

@rrsettgast
Copy link
Member

@rrsettgast - I've created a 2x2x2 mesh with a vertical fault that shows the same behavior (see attached). The mesh appears to be split correctly if there is a partition aligned with the surface (-x 2 , -y 2 , or -z 2). However if a partition corner is on the surface (-x 2 -y 2 , -x 2 -z 2 , or -y 2 -z 2), then we get the error:

***** ERROR
***** LOCATION: /usr/WS2/sherman/GEOSX/src/coreComponents/mesh/ObjectManagerBase.hpp:1024
***** Controlling expression (should be false): !allValuesMapped
***** Rank 2: some values of unmappedIndices were not used

** StackTrace of 10 frames **
Frame 0: geosx::EdgeManager::fixUpDownMaps(bool) 
Frame 1: geosx::ParallelTopologyChange::synchronizeTopologyChange(geosx::MeshLevel*, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, geosx::ModifiedObjectLists&, geosx::ModifiedObjectLists&, int) 
Frame 2: geosx::SurfaceGenerator::separationDriver(geosx::DomainPartition&, geosx::MeshLevel&, std::vector<geosx::NeighborCommunicator, std::allocator<geosx::NeighborCommunicator> >&, int, int, bool, double) 
Frame 3: geosx::SurfaceGenerator::solverStep(double const&, double const&, int, geosx::DomainPartition&) 
Frame 4: geosx::SurfaceGenerator::execute(double, double, int, int, double, geosx::DomainPartition&) 
Frame 5: geosx::EventBase::execute(double, double, int, int, double, geosx::DomainPartition&) 
Frame 6: geosx::EventManager::run(geosx::DomainPartition&) 
Frame 7: geosx::GeosxState::run() 
Frame 8: main 
Frame 9: __libc_start_main 
Frame 10: /g/g17/sherman/GEOS/geosx/GEOSX/build-quartz-gcc@8.1.0-release/bin/geosx 
=====

Note: the problem doesn't run for 8 partitions (-x 2 -y 2 -z 2) due to an error that occurs in pamela for such a small mesh.

small_test.zip

Since this is read from PAMELA, the requested partition layout is irrelevant. So the -x2 -y2 are ignored. Metis partitions the problem however it sees fit. In this case, the Metis partitions are:
Screen Shot 2022-10-27 at 11 09 18 PM

Screen Shot 2022-10-27 at 11 09 31 PM

So it is a wonky partition...but we should still be able to handle it. I suspect the ghosting is incorrect.

@TotoGaz
Copy link
Contributor

TotoGaz commented Oct 28, 2022

This is great that you got a tiny reproducer.
It's always been clear that there was issues with the ghost cells, but with this mesh it's surely easier to debug 🤞 🎉 🌮
If you want I can try to help you in the debugging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

9 participants