Skip to content

hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0 #7201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 3, 2020

Conversation

bgoglin
Copy link
Contributor

@bgoglin bgoglin commented Nov 27, 2019

This should be backported to 4.0.x and 3.1.x (I am not sure how you guys handle this)

Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/users@lists.open-mpi.org/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin Brice.Goglin@inria.fr

Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/users@lists.open-mpi.org/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
@hppritcha
Copy link
Member

@bgoglin this should be backported to 4.0.x?

@hppritcha
Copy link
Member

Sorry didn’t read you commit message.

@hppritcha
Copy link
Member

@bgoglin when I try to test this PR on an x86_64 dual core haswell system it doesn't seem to be working:

hpp@cn165:~>mpirun -np 48 ./test_numa_bug
---- WORLD Communicator size: 48
**** SHARED Communicator size: 48
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1

@bgoglin
Copy link
Contributor Author

bgoglin commented Jan 27, 2020

@hppritcha Strange, I just tested again. Git master (last nightly snapshot) fails as expected, patch from this PR works fine. Using the internal hwloc. On dual-socket 12-core haswell with Cluster-on-Die (2 sockets, 2 NUMA nodes each, 6 core each).

@gpaulsen
Copy link
Member

@bgoglin what is your mpirun command line, specifically hostfile details? I'm trying on ppc64le as well.

@bgoglin
Copy link
Contributor Author

bgoglin commented Jan 29, 2020

There's nothing interesting in my command-line, I am running on a single node with: mpiexec --mca mtl ^psm,psm2 -np 24 ~/split

By the way, I added more tests to the test program to confirm that only the NUMA level was broken:

#include <stdio.h>
#include "mpi.h"

int main(){
    MPI_Init(NULL, NULL);

    int world_size, shared_size, numa_size, sock_size, l3_size;
    int world_rank, shared_rank, numa_rank, sock_rank, l3_rank;
    int key = 0;
    MPI_Comm shared_comm, numa_comm, sock_comm, l3_comm;

    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, key, MPI_INFO_NULL, &shared_comm);
    MPI_Comm_size(shared_comm, &shared_size);
    MPI_Comm_rank(shared_comm, &shared_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_SOCKET, key, MPI_INFO_NULL, &sock_comm);
    MPI_Comm_size(sock_comm, &sock_size);
    MPI_Comm_rank(sock_comm, &sock_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_NUMA, key, MPI_INFO_NULL, &numa_comm);
    MPI_Comm_size(numa_comm, &numa_size);
    MPI_Comm_rank(numa_comm, &numa_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, key, MPI_INFO_NULL, &l3_comm);
    MPI_Comm_size(l3_comm, &l3_size);
    MPI_Comm_rank(l3_comm, &l3_rank);

    if(world_rank == 0){
       printf("---- WORLD Communicator size: %d\n", world_size);
    }
    if(shared_rank == 0){
      printf("**** SHARED Communicator size: %d\n", shared_size);
    }
    if(sock_rank == 0){
      printf("#### SOCK Communicator size: %d\n", sock_size);
    }
    if(numa_rank == 0){
      printf("#### NUMA Communicator size: %d\n", numa_size);
    }
    if(l3_rank == 0){
      printf("#### L3 Communicator size: %d\n", l3_size);
    }
    
    MPI_Finalize();
    return 0;
}

@bgoglin
Copy link
Contributor Author

bgoglin commented Jan 29, 2020

I also tried with 4 of those haswell nodes in a single job with 96 ranks, works fine.
And then between 144 ranks on 4 skylake nodes (2 sockets x 1 numa x 18 cores), no problem either.

@hppritcha
Copy link
Member

okay not sure what was going on but now works for me, at least on ARM TX2 nodes.

@hppritcha hppritcha merged commit d2b68e6 into open-mpi:master Feb 3, 2020
hppritcha added a commit to hppritcha/ompi that referenced this pull request Feb 3, 2020
not being defined.

related to open-mpi#7201

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this pull request Feb 4, 2020
PR open-mpi#7201 broke use of hwloc 1.x series.  this patches gets hwloc 1.x working again with OMPI

fixes open-mpi#7362

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
cniethammer pushed a commit to cniethammer/ompi that referenced this pull request May 10, 2020
not being defined.

related to open-mpi#7201

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants