hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0 #7201

bgoglin · 2019-11-27T11:40:27Z

This should be backported to 4.0.x and 3.1.x (I am not sure how you guys handle this)

Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/users@lists.open-mpi.org/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin Brice.Goglin@inria.fr

Both opal_hwloc_base_get_relative_locality() and _get_locality_string() iterate over hwloc levels to build the proc locality information. Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0. We have to explicitly look a the special NUMA level to get that locality info. I am factorizing the core of the iterations inside dedicated "_by_depth" functions and calling them again for the NUMA level at the end of the loops. Thanks to Hatem Elshazly for reporting the NUMA communicator split failure at https://www.mail-archive.com/users@lists.open-mpi.org/msg33589.html It looks like only the opal_hwloc_base_get_locality_string() part is needed to fix that split, but there's no reason not to fix get_relative_locality() as well. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

hppritcha · 2019-11-27T23:07:31Z

@bgoglin this should be backported to 4.0.x?

hppritcha · 2019-11-27T23:08:01Z

Sorry didn’t read you commit message.

hppritcha · 2020-01-27T20:44:43Z

@bgoglin when I try to test this PR on an x86_64 dual core haswell system it doesn't seem to be working:

hpp@cn165:~>mpirun -np 48 ./test_numa_bug
---- WORLD Communicator size: 48
**** SHARED Communicator size: 48
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1
#### NUMA Communicator size: 1

bgoglin · 2020-01-27T21:29:03Z

@hppritcha Strange, I just tested again. Git master (last nightly snapshot) fails as expected, patch from this PR works fine. Using the internal hwloc. On dual-socket 12-core haswell with Cluster-on-Die (2 sockets, 2 NUMA nodes each, 6 core each).

gpaulsen · 2020-01-29T20:51:12Z

@bgoglin what is your mpirun command line, specifically hostfile details? I'm trying on ppc64le as well.

bgoglin · 2020-01-29T20:56:49Z

There's nothing interesting in my command-line, I am running on a single node with: mpiexec --mca mtl ^psm,psm2 -np 24 ~/split

By the way, I added more tests to the test program to confirm that only the NUMA level was broken:

#include <stdio.h>
#include "mpi.h"

int main(){
    MPI_Init(NULL, NULL);

    int world_size, shared_size, numa_size, sock_size, l3_size;
    int world_rank, shared_rank, numa_rank, sock_rank, l3_rank;
    int key = 0;
    MPI_Comm shared_comm, numa_comm, sock_comm, l3_comm;

    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, key, MPI_INFO_NULL, &shared_comm);
    MPI_Comm_size(shared_comm, &shared_size);
    MPI_Comm_rank(shared_comm, &shared_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_SOCKET, key, MPI_INFO_NULL, &sock_comm);
    MPI_Comm_size(sock_comm, &sock_size);
    MPI_Comm_rank(sock_comm, &sock_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_NUMA, key, MPI_INFO_NULL, &numa_comm);
    MPI_Comm_size(numa_comm, &numa_size);
    MPI_Comm_rank(numa_comm, &numa_rank);

    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, key, MPI_INFO_NULL, &l3_comm);
    MPI_Comm_size(l3_comm, &l3_size);
    MPI_Comm_rank(l3_comm, &l3_rank);

    if(world_rank == 0){
       printf("---- WORLD Communicator size: %d\n", world_size);
    }
    if(shared_rank == 0){
      printf("**** SHARED Communicator size: %d\n", shared_size);
    }
    if(sock_rank == 0){
      printf("#### SOCK Communicator size: %d\n", sock_size);
    }
    if(numa_rank == 0){
      printf("#### NUMA Communicator size: %d\n", numa_size);
    }
    if(l3_rank == 0){
      printf("#### L3 Communicator size: %d\n", l3_size);
    }
    
    MPI_Finalize();
    return 0;
}

bgoglin · 2020-01-29T21:28:28Z

I also tried with 4 of those haswell nodes in a single job with 96 ranks, works fine.
And then between 144 ranks on 4 skylake nodes (2 sockets x 1 numa x 18 cores), no problem either.

hppritcha · 2020-02-03T18:22:57Z

okay not sure what was going on but now works for me, at least on ARM TX2 nodes.

not being defined. related to open-mpi#7201 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

PR open-mpi#7201 broke use of hwloc 1.x series. this patches gets hwloc 1.x working again with OMPI fixes open-mpi#7362 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

not being defined. related to open-mpi#7201 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

bgoglin force-pushed the master branch from 2f4d45d to ea80a20 Compare November 27, 2019 11:42

hppritcha mentioned this pull request Dec 3, 2019

Communicator Split Type NUMA Behavior #7214

Closed

hppritcha self-requested a review January 24, 2020 15:23

hppritcha approved these changes Feb 3, 2020

View reviewed changes

hppritcha merged commit d2b68e6 into open-mpi:master Feb 3, 2020

hppritcha added a commit to hppritcha/ompi that referenced this pull request Feb 3, 2020

fix a problem with opal_asprintf

bed0ce7

not being defined. related to open-mpi#7201 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

jsquyres mentioned this pull request Feb 4, 2020

Recent hwloc base change breaks external hwloc <= v2.0.x #7362

Closed

hppritcha mentioned this pull request Feb 4, 2020

hwloc: patch to handle hwloc 1.11.x and older #7365

Closed

rhc54 mentioned this pull request Feb 10, 2020

Update hwloc integration openpmix/prrte#341

Merged

cniethammer pushed a commit to cniethammer/ompi that referenced this pull request May 10, 2020

fix a problem with opal_asprintf

f1b870f

not being defined. related to open-mpi#7201 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0 #7201

hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0 #7201

bgoglin commented Nov 27, 2019

hppritcha commented Nov 27, 2019

hppritcha commented Nov 27, 2019

hppritcha commented Jan 27, 2020

bgoglin commented Jan 27, 2020

gpaulsen commented Jan 29, 2020

bgoglin commented Jan 29, 2020

bgoglin commented Jan 29, 2020

hppritcha commented Feb 3, 2020

hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0 #7201

hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0 #7201

Conversation

bgoglin commented Nov 27, 2019

hppritcha commented Nov 27, 2019

hppritcha commented Nov 27, 2019

hppritcha commented Jan 27, 2020

bgoglin commented Jan 27, 2020

gpaulsen commented Jan 29, 2020

bgoglin commented Jan 29, 2020

bgoglin commented Jan 29, 2020

hppritcha commented Feb 3, 2020