Improve NUMA affinity and startup times for configs that use a fixed heap #17405

ronawho · 2021-03-15T22:53:44Z

Improve the startup time and NUMA affinity for configurations that use a
fixed heap by interleaving and parallelizing the heap fault-in. High
performance networks require that memory is registered with the NIC/HCA
in order to do RDMA. We can either register all communicable memory at
startup using a fixed heap or we can register memory dynamically at some
point after it's been allocated in the user program.

Static registration can offer better communication performance since
there's just one registration call at startup and no lookups or
registration at communication time. However, static registration causes
slow startup because all memory is being faulted in at program startup
and prior to this effort that was done serially as a side effect of
registering memory with the NIC. Serial fault-in also resulted in poor
NUMA affinity and ignored user first-touch. Effectively, this meant that
most operations were just using memory out of NUMA domain 0, which
created a bandwidth bottleneck. Because of slow startup and poor
affinity we historically preferred dynamic registration when available
(for gasnet-ibv we default to segment large instead of fast, for ugni we
default we prefer dynamic registration.)

This PR improves the situation for static registration by touching the
heap in parallel prior to registration, which improves fault-in speed.
We also interleave the memory faults so that pages are spread
round-robin or cyclically across the NUMA domains. This results in
better NUMA behavior since we're not just using NUMA domain 0. Half our
memory references will still be wrong so NUMA affinity isn't really
"better" we're just spreading load between the memory controllers.

Here are some performance results for stream on a couple different
platforms. Stream has no communication and is NUMA affinity sensitive.
The tables below show the reported benchmark rate and the total
execution time to show startup costs. Results for dynamic registration
are shown as a best case comparison. Results have been rounded to make
them easier to parse (nearest 5 GB/s and 1 second.) Generally speaking
we see better, but not perfect performance and significant improvements
in startup time.

export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
chpl examples/benchmarks/hpcc/stream.chpl --fast
./stream -nl 8 --m=2861913600

Cray XC:

16M hugepages for ugni. Static configs use CHPL_RT_MAX_HEAP_SIZE=106G

config	stream	runtime
ugni dynamic	735 GB/s	3s
ugni static	325 GB/s	33s
ugni static opt	325 GB/s	12s
gn-aries static	320 GB/s	33s
gn-aries static opt	565 GB/s	8s

ugni static registration is faster with this change, but NUMA affinity
doesn't change because the system default of HUGETLB_NO_RESERVE=no
means pages are pre-reserved before being faulted in.

For gasnet-aries we can see this improves startup time and improves NUMA
affinity. As expected it's not as good as user first-touch but it's
better than before.

Cray CS (Intel):

2M Transparent Huge Pages (THP). Static configs use
GASNET_PHYSMEM_MAX='106 GB' CHPL_RT_MAX_HEAP_SIZE=106G

config	stream	runtime
gn-ibv-large dynamic	760 GB/s	2s
gn-ibv-fast static	325 GB/s	53s
gn-ibv-fast static opt	575 GB/s	11s

Here we see the expected improvements to NUMA affinity and startup time
for static registration under gasnet.

Results for ofi on the same CS. These results are a little less obvious
because tcp and verbs suffer from dynamic connection costs that hurt
stream performance. The trends are the same though, it's just that raw
stream performance is lower.

config	stream	runtime
ofi-sockets no-reg	750 GB/s	2s
ofi-tcp no-reg	605 GB/s	5s
ofi-verbs static	300 GB/s	54s
ofi-verbs static opt	505 GB/s	14s

Cray CS (AMD):

2M Transparent Huge Pages (THP) on Rome CPUs. Static configs use
GASNET_PHYSMEM_MAX='427 GB' CHPL_RT_MAX_HEAP_SIZE=427G

config	stream	runtime
gn-ibv-large dynamic	1725 GB/s	1s
gn-ibv-fast static	155 GB/s	100s
gn-ibv-fast static opt	820 GB/s	16s

Here the trends are the same as above but we can see the impact from
getting NUMA affinity wrong on Rome chips is much worse than we've seen
on Intel chips in the past. The startup time improvement is also
slightly better which is good since these nodes have a lot of memory.

Other:

And some runs on Power, Arm, and AWS that have similar trends, but I
wanted to check since Arm/Power have different page sizes and AWS is
another interesting place to check ofi.

IB Power9:

Powerpc with IB network. Power has 64K system pages. Static configs use
GASNET_PHYSMEM_MAX='212 GB' CHPL_RT_MAX_HEAP_SIZE=212G

config	stream	runtime
gn-ibv-large dynamic	1060 GB/s	2s
gn-ibv-fast static	335 GB/s	21s
gn-ibv-fast static opt	410 GB/s	9s

IB ARM:

ARM with IB network. Arm also has 64K system pages. Static configs use
GASNET_PHYSMEM_MAX='24 GB' CHPL_RT_MAX_HEAP_SIZE=24G

config	stream	runtime
gn-ibv-large dynamic	2355 GB/s	3s
gn-ibv-fast static	720 GB/s	6s
gn-ibv-fast static opt	1350 GB/s	4s

AWS:

AWS instances with ofi. Static configs use CHPL_RT_MAX_HEAP_SIZE=64G

config	stream	runtime
ofi-sockets no reg	1150 GB/s	3s
ofi-tcp no reg	890 GB/s	4s
ofi-efa static	510 GB/s	30s
of-efa static opt	860 GB/s	10s

In terms of the actual implementation we basically create a thread per
core, pin it to a specific NUMA domain and touch a page of memory in a
round-robin fashion. This happens very early in program startup so we
have to manually create and pin pthreads instead of using our tasking
layer. This approach requires an accurate page size, but we don't have
that for Transparent Huge Pages (THP) so we just use a minimum of 2M,
which is the most common THP size. Longer term I'd like to use hwloc to
set an interleave memory policy and then just touch large chunks of
memory, but that requires libnuma and I didn't want to bring that in as
a dependency for the initial implementation. That's captured as future
work in https://github.com/Cray/chapel-private/issues/1816

Resolves https://github.com/Cray/chapel-private/issues/1088
Resolves https://github.com/Cray/chapel-private/issues/1798
Helps #9166

…heap Improve the startup time and NUMA affinity for configurations that use a fixed heap by interleaving and parallelizing the heap fault-in. High performance networks require that memory is registered with the NIC/HCA in order to do RDMA. We can either register all communicable memory at startup using a fixed heap or we can register memory dynamically at some point after it's been allocated in the user program. Static registration can offer better communication performance since there's just one registration call at startup and no lookups or registration at communication time. However, static registration causes slow startup because all memory is being faulted in at program startup and prior to this effort that was done serially as a side effect of registering memory with the NIC. Serial fault-in also resulted in poor NUMA affinity and ignored user first-touch. Effectively, this meant that most operations were just using memory out of NUMA domain 0, which created a bandwidth bottleneck. Because of slow startup and poor affinity we historically preferred dynamic registration when available (for gasnet-ibv we default to segment large instead of fast, for ugni we default we prefer dynamic registration.) This PR improves the situation for static registration by touching the heap in parallel prior to registration, which improves fault-in speed. We also interleave the memory faults so that pages are spread round-robin or cyclically across the NUMA domains. This results in better NUMA behavior since we're not just using NUMA domain 0. Half our memory references will still be wrong so NUMA affinity isn't really "better" we're just spreading load between the memory controllers. Here are some performance results for stream on a couple different platforms. Stream has no communication and is NUMA affinity sensitive. The tables below show the reported benchmark rate and the total execution time to show startup costs. Results for dynamic registration are shown as a best case comparison. Results have been rounded to make them easier to parse (nearest 5 GB/s and 1 second.) Generally speaking we see better, but not perfect performance and significant improvements in startup time. ```sh export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer chpl examples/benchmarks/hpcc/stream.chpl --fast ./stream -nl 8 --m=2861913600 ``` Cray XC: --- 16M hugepages for ugni. Static configs use `CHPL_RT_MAX_HEAP_SIZE=106G` | config | stream | runtime | | ------------------- | --------:| ------: | | ugni dynamic | 735 GB/s | 3s | | ugni static | 325 GB/s | 33s | | ugni static opt | 325 GB/s | 12s | | gn-aries static | 320 GB/s | 33s | | gn-aries static opt | 565 GB/s | 8s | ugni static registration is faster with this change, but NUMA affinity doesn't change because the system default of `HUGETLB_NO_RESERVE=no` means pages are pre-reserved before being faulted in. For gasnet-aries we can see this improves startup time and improves NUMA affinity. As expected it's not as good as user first-touch but it's better than before. Cray CS (Intel): --- 2M Transparent Huge Pages (THP). Static configs use `GASNET_PHYSMEM_MAX='106 GB' CHPL_RT_MAX_HEAP_SIZE=106G` | config | stream | runtime | | ---------------------- | --------:| ------: | | gn-ibv-large dynamic | 760 GB/s | 2s | | gn-ibv-fast static | 325 GB/s | 53s | | gn-ibv-fast static opt | 575 GB/s | 11s | Here we see the expected improvements to NUMA affinity and startup time for static registration under gasnet. Results for ofi on the same CS. These results are a little less obvious because tcp and verbs suffer from dynamic connection costs that hurt stream performance. The trends are the same though, it's just that raw stream performance is lower. | config | stream | runtime | | ---------------------- | --------:| ------: | | ofi-sockets no-reg | 750 GB/s | 2s | | ofi-tcp no-reg | 605 GB/s | 5s | | ofi-verbs static | 300 GB/s | 54s | | ofi-verbs static opt | 505 GB/s | 14s | Cray CS (AMD): --- 2M Transparent Huge Pages (THP) on Rome CPUs. Static configs use `GASNET_PHYSMEM_MAX='427 GB' CHPL_RT_MAX_HEAP_SIZE=427G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 1725 GB/s | 1s | | gn-ibv-fast static | 155 GB/s | 100s | | gn-ibv-fast static opt | 820 GB/s | 16s | Here the trends are the same as above but we can see the impact from getting NUMA affinity wrong on Rome chips is much worse than we've seen on Intel chips in the past. The startup time improvement is also slightly better which is good since these nodes have a lot of memory. Other: --- And some runs on Power, Arm, and AWS that have similar trends, but I wanted to check since Arm/Power have different page sizes and AWS is another interesting place to check ofi. <details> IB Power9: --- Powerpc with IB network. Power has 64K system pages. Static configs use `GASNET_PHYSMEM_MAX='212 GB' CHPL_RT_MAX_HEAP_SIZE=212G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 1060 GB/s | 2s | | gn-ibv-fast static | 335 GB/s | 21s | | gn-ibv-fast static opt | 410 GB/s | 9s | IB ARM: --- ARM with IB network. Arm also has 64K system pages. Static configs use `GASNET_PHYSMEM_MAX='24 GB' CHPL_RT_MAX_HEAP_SIZE=24G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 2355 GB/s | 3s | | gn-ibv-fast static | 720 GB/s | 6s | | gn-ibv-fast static opt | 1350 GB/s | 4s | AWS: --- AWS instances with ofi. Static configs use `CHPL_RT_MAX_HEAP_SIZE=64G` | config | stream | runtime | | -------------------- | ---------:| ------: | | ofi-sockets no reg | 1150 GB/s | 3s | | ofi-tcp no reg | 890 GB/s | 4s | | ofi-efa static | 510 GB/s | 30s | | of-efa static opt | 860 GB/s | 10s | </details> In terms of the actual implementation we basically create a thread per core, pin it to a specific NUMA domain and touch a page of memory in a round-robin fashion. This happens very early in program startup so we have to manually create and pin pthreads instead of using our tasking layer. This approach requires an accurate page size, but we don't have that for Transparent Huge Pages (THP) so we just use a minimum of 2M, which is the most common THP size. Longer term I'd like to use hwloc to set an interleave memory policy and then just touch large chunks of memory, but that requires libnuma and I didn't want to bring that in as a dependency for the initial implementation. Closes private 1088 Closes private 1798 Closes public 9166 Signed-off-by: Elliot Ronaghan <ronawho@gmail.com>

gbtitus

Nice! Looks good to me.

@gbtitus

…-init Improve NUMA affinity and startup times for configs that use a fixed heap [reviewed by @gbtitus] Improve the startup time and NUMA affinity for configurations that use a fixed heap by interleaving and parallelizing the heap fault-in. High performance networks require that memory is registered with the NIC/HCA in order to do RDMA. We can either register all communicable memory at startup using a fixed heap or we can register memory dynamically at some point after it's been allocated in the user program. Static registration can offer better communication performance since there's just one registration call at startup and no lookups or registration at communication time. However, static registration causes slow startup because all memory is being faulted in at program startup and prior to this effort that was done serially as a side effect of registering memory with the NIC. Serial fault-in also resulted in poor NUMA affinity and ignored user first-touch. Effectively, this meant that most operations were just using memory out of NUMA domain 0, which created a bandwidth bottleneck. Because of slow startup and poor affinity we historically preferred dynamic registration when available (for gasnet-ibv we default to segment large instead of fast, for ugni we default we prefer dynamic registration.) This PR improves the situation for static registration by touching the heap in parallel prior to registration, which improves fault-in speed. We also interleave the memory faults so that pages are spread round-robin or cyclically across the NUMA domains. This results in better NUMA behavior since we're not just using NUMA domain 0. Half our memory references will still be wrong so NUMA affinity isn't really "better" we're just spreading load between the memory controllers. Here are some performance results for stream on a couple different platforms. Stream has no communication and is NUMA affinity sensitive. The tables below show the reported benchmark rate and the total execution time to show startup costs. Results for dynamic registration are shown as a best case comparison. Results have been rounded to make them easier to parse (nearest 5 GB/s and 1 second.) Generally speaking we see better, but not perfect performance and significant improvements in startup time. ```sh export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer chpl examples/benchmarks/hpcc/stream.chpl --fast ./stream -nl 8 --m=2861913600 ``` Cray XC: --- 16M hugepages for ugni. Static configs use `CHPL_RT_MAX_HEAP_SIZE=106G` | config | stream | runtime | | ------------------- | --------:| ------: | | ugni dynamic | 735 GB/s | 3s | | ugni static | 325 GB/s | 33s | | ugni static opt | 325 GB/s | 12s | | gn-aries static | 320 GB/s | 33s | | gn-aries static opt | 565 GB/s | 8s | ugni static registration is faster with this change, but NUMA affinity doesn't change because the system default of `HUGETLB_NO_RESERVE=no` means pages are pre-reserved before being faulted in. For gasnet-aries we can see this improves startup time and improves NUMA affinity. As expected it's not as good as user first-touch but it's better than before. Cray CS (Intel): --- 2M Transparent Huge Pages (THP). Static configs use `GASNET_PHYSMEM_MAX='106 GB' CHPL_RT_MAX_HEAP_SIZE=106G` | config | stream | runtime | | ---------------------- | --------:| ------: | | gn-ibv-large dynamic | 760 GB/s | 2s | | gn-ibv-fast static | 325 GB/s | 53s | | gn-ibv-fast static opt | 575 GB/s | 11s | Here we see the expected improvements to NUMA affinity and startup time for static registration under gasnet. Results for ofi on the same CS. These results are a little less obvious because tcp and verbs suffer from dynamic connection costs that hurt stream performance. The trends are the same though, it's just that raw stream performance is lower. | config | stream | runtime | | ---------------------- | --------:| ------: | | ofi-sockets no-reg | 750 GB/s | 2s | | ofi-tcp no-reg | 605 GB/s | 5s | | ofi-verbs static | 300 GB/s | 54s | | ofi-verbs static opt | 505 GB/s | 14s | Cray CS (AMD): --- 2M Transparent Huge Pages (THP) on Rome CPUs. Static configs use `GASNET_PHYSMEM_MAX='427 GB' CHPL_RT_MAX_HEAP_SIZE=427G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 1725 GB/s | 1s | | gn-ibv-fast static | 155 GB/s | 100s | | gn-ibv-fast static opt | 820 GB/s | 16s | Here the trends are the same as above but we can see the impact from getting NUMA affinity wrong on Rome chips is much worse than we've seen on Intel chips in the past. The startup time improvement is also slightly better which is good since these nodes have a lot of memory. Other: --- And some runs on Power, Arm, and AWS that have similar trends, but I wanted to check since Arm/Power have different page sizes and AWS is another interesting place to check ofi. <details> IB Power9: --- Powerpc with IB network. Power has 64K system pages. Static configs use `GASNET_PHYSMEM_MAX='212 GB' CHPL_RT_MAX_HEAP_SIZE=212G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 1060 GB/s | 2s | | gn-ibv-fast static | 335 GB/s | 21s | | gn-ibv-fast static opt | 410 GB/s | 9s | IB ARM: --- ARM with IB network. Arm also has 64K system pages. Static configs use `GASNET_PHYSMEM_MAX='24 GB' CHPL_RT_MAX_HEAP_SIZE=24G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 2355 GB/s | 3s | | gn-ibv-fast static | 720 GB/s | 6s | | gn-ibv-fast static opt | 1350 GB/s | 4s | AWS: --- AWS instances with ofi. Static configs use `CHPL_RT_MAX_HEAP_SIZE=64G` | config | stream | runtime | | -------------------- | ---------:| ------: | | ofi-sockets no reg | 1150 GB/s | 3s | | ofi-tcp no reg | 890 GB/s | 4s | | ofi-efa static | 510 GB/s | 30s | | of-efa static opt | 860 GB/s | 10s | </details> In terms of the actual implementation we basically create a thread per core, pin it to a specific NUMA domain and touch a page of memory in a round-robin fashion. This happens very early in program startup so we have to manually create and pin pthreads instead of using our tasking layer. This approach requires an accurate page size, but we don't have that for Transparent Huge Pages (THP) so we just use a minimum of 2M, which is the most common THP size. Longer term I'd like to use hwloc to set an interleave memory policy and then just touch large chunks of memory, but that requires libnuma and I didn't want to bring that in as a dependency for the initial implementation. That's captured as future work in https://github.com/Cray/chapel-private/issues/1816 Resolves https://github.com/Cray/chapel-private/issues/1088 Resolves https://github.com/Cray/chapel-private/issues/1798 Helps chapel-lang#9166

Improve our gasnet support for the ofi conduit GASNet-EX 2022.3.0 significantly improved the ofi conduit in order to support Slingshot 10/11 networks and restored support for Omni-Path networks. Now that ofi support is back, improve our usage of it. For the chplenv, change the default conduit to be ofi when gasnet is used on hpe-cray-ex systems and ensure segment fast is used in those cases. Note that we don't always default to segment fast since that isn't helpful on omni-path systems. In the gasnet shim, serialize active-message polling for ofi (this is similar to #14912 and #17419) and parallelize the heap fault-in for ofi segment fast (similar to #17405). This results in significant performance improvements for gasnet-ofi. For instance, parallelizing the heap fault-in takes 16 node indexgather on SS-11 from 620 MB/s/node to 730 MB/s/node and serializing polling takes it up to 3875 MB/s/node. On an omnipath system we see serialization takes us from 415 MB/s/node to 490 MB/s/node. There are a few places where we expect gasnet-ofi might be our default. This is definitely true for omni-path systems, where ofi with the psm2 provider is recommended. Note that our native `CHPL_COMM=ofi` layer does not work with psm2 and we don't expect to put in the effort to get it working (beyond the comm layer, we would also need spawning and out-of-band support that I don't think is worth adding currently.) On Slingshot-10 systems it's still up in the air if gasnet-ofi, gasnet-ucx, or our native ofi comm layer will be best in the long term. Currently, our native ofi layer is not working there, but this is a bug we need to address. And lastly it's possible to use gasnet-ofi on Slingshot-11 systems, but we expect our native ofi comm layer to be the preferred option since that's what we've mostly been developing it for. This is much like using our native ugni layer on Aries systems instead of gasnet-aries because it gives us more control on flagship HPE/Cray systems. In order to evaluate the current state of gasnet-ofi and what we might recommend to users I gathered performance figures on a few systems for 5 benchmarks that expose different patterns/idioms we care about: - Stream (no communication, numa affinity sensitive) - PRK-stencil (little comm, numa affinity sensitive) - ISx (concurrent bulk comm, numa affinity sensitive) - Indexgather (concurrent bulk/aggregated comm) - RA (concurrent fine-grained comm -- RDMA (rmo) and AM (on) variants) ``` chpl --fast test/release/examples/benchmarks/hpcc/stream.chpl chpl --fast test/studies/prk/Stencil/optimized/stencil-opt.chpl -sorder="sqrt(16e9*numLocales / 8):int" chpl --fast test/studies/isx/isx-hand-optimized.chpl -smode=scaling.weakISO chpl --fast test/studies/bale/indexgather/ig.chpl -sN=10000000 -sprintStats -smode=Mode.aggregated chpl --fast test/release/examples/benchmarks/hpcc/ra.chpl -sverify=false -suseOn=false -sN_U="2**(n-12)" -o ra-rmo chpl --fast test/release/examples/benchmarks/hpcc/ra.chpl -sverify=false -suseOn=true -sN_U="2**(n-12)" -o ra-on ./stream -nl 16 ./stencil-opt -nl 16 ./isx-hand-optimized -nl 16 ./ig -nl 16 ./ra-rmo -nl 16 ./ra-on -nl 16 ``` Omni-path: ---------- 16-nodes of an OPA cluster with 32 cores and 192 GB of ram per node: | Config | Stream | PRK-Stencil | ISx | Indexgather | RA-rmo | RA-on | | --------- | --------: | -----------: | ----: | ------------: | -----------: | -----------: | | Gasnet-1 | 2120 GB/s | 940 GFlops/s | 13.1s | 400 MB/s/node | 0.00082 GUPS | 0.00059 GUPS | | 2021.9.0 | 2120 GB/s | 940 GFlops/s | 14.6s | 290 MB/s/node | 0.00045 GUPS | 0.00059 GUPS | | 2022.3.0 | 2120 GB/s | 940 GFlops/s | 13.3s | 415 MB/s/node | 0.00086 GUPS | 0.00058 GUPS | | ser-poll | 2120 GB/s | 945 GFlops/s | 12.3s | 495 MB/s/node | 0.00086 GUPS | 0.00174 GUPS | Previously, omni-path users had to revert to gasnet-1 and the initial ofi support added in 2021.9.0 hurt performance. What we see now is that 2022.3.0 restores performance to gasnet-1 levels and serializing the poller further improves performance. This makes me comfortable telling users to stop falling back to gasnet-1 and just use the current support for omni-path. Slingshot-10: ------------- 16-nodes of a SS-10 system with 128 cores and 512 GB of ram per node: | Config | Stream | PRK-Stencil | ISx | Indexgather | RA-rmo | RA-on | | ---------- | --------: | ------------: | ----: | -------------: | ----------: | ----------: | | gasnet-ofi | 2335 GB/s | 1435 GFlops/s | 31.7s | 1830 MB/s/node | 0.0033 GUPS | 0.0030 GUPS | | gasnet-ucx | 2355 GB/s | 1420 GFlops/s | 16.4s | 2290 MB/s/node | 0.0021 GUPS | 0.0015 GUPS | Generally speaking, gasnet-ucx seems to perform the best on SS-10. I should also note that our native `CHPL_COMM=ofi` layer does not appear to be working correctly on SS-10 so I don't have that to compare to, though we'll want to get that working in the near term. Also note that serializing polling was important for performance for ofi on SS-10. Slingshot-11: ------------- 16-nodes of a SS-11 system with 128 cores and 512 GB of ram per node: | Config | Stream | PRK-Stencil | ISx | Indexgather | RA-rmo | RA-on | | ---------- | --------: | ------------: | ----: | -------------: | ---------: | ---------: | | gasnet-ofi | 2460 GB/s | 1470 MFlops/s | 15.2s | 3875 MB/s/node | 0.003 GUPS | 0.004 GUPS | | comm=ofi | 2565 GB/s | 1540 MFlops/s | 7.3s | 5030 MB/s/node | 0.132 GUPS | 0.018 GUPS Our native `CHPL_COMM=ofi` layer generally outperforms gasnet-ofi. This is especially true for concurrent fine-grained operations like RA where injection is serialized under gasnet currently. Note that much like ugni on XC, we expect our ofi comm layer to be the native/default layer on SS-11, but the comparison is still interesting and I expect we can improve gasnet-ofi SS-11 performance with some tuning. Summary: -------- Overall, GASNet-EX 2022.3.0 significantly improved ofi performance, and bringing in optimizations we applied to other conduits further improved performance. I recommend gasnet-ex for omni-path in the short and long term, gasnet-ucx for ss-10 in the short term and TBD for the long term, and our native ofi layer for ss-11. The other potential targets for ofi in the future are AWS EFA and Cornelis Omni-Path, but these will require more exploration.

ronawho requested a review from gbtitus March 15, 2021 22:53

gbtitus approved these changes Mar 15, 2021

View reviewed changes

This was referenced Mar 15, 2021

Improve gasnet-aries startup time #9166

Closed

Set the number of launcher cores per locale for Cray CS perf testing #17407

Merged

ronawho merged commit c5e9e86 into chapel-lang:master Mar 17, 2021

ronawho deleted the par-interleave-heap-init branch March 17, 2021 17:21

ronawho mentioned this pull request Apr 1, 2021

Gather some InfiniBand timings npadmana/DistributedFFT#49

Closed

ronawho mentioned this pull request Sep 7, 2021

Using a fixed heap can lead to fragmentation #18286

Open

ronawho mentioned this pull request Jun 17, 2022

Improve our gasnet support for the ofi conduit #20033

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve NUMA affinity and startup times for configs that use a fixed heap #17405

Improve NUMA affinity and startup times for configs that use a fixed heap #17405

ronawho commented Mar 15, 2021 •

edited

Loading

gbtitus left a comment

Improve NUMA affinity and startup times for configs that use a fixed heap #17405

Improve NUMA affinity and startup times for configs that use a fixed heap #17405

Conversation

ronawho commented Mar 15, 2021 • edited Loading

Cray CS (Intel):

Cray CS (AMD):

Other:

IB Power9:

IB ARM:

AWS:

gbtitus left a comment

Choose a reason for hiding this comment

ronawho commented Mar 15, 2021 •

edited

Loading