-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve gasnet-aries startup time #9166
Labels
Comments
This will be an issue so long as static registration is used. Ugni is our preferred/recommended configuration for crays so I'm going to close this, but I've added it to #5703 (comment) |
This will be improved by #17405, which parallelizes the heap fault in. Running a no-op program on an XC with 128 GB of memory before that PR: export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
chpl test/performance/elliot/no-op.chpl --fast
./no-op -nl 1
> 32.28 And with that PR: export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
chpl test/performance/elliot/no-op.chpl --fast
./no-op -nl 1
> real 5.911 |
ronawho
added a commit
that referenced
this issue
Mar 17, 2021
Improve NUMA affinity and startup times for configs that use a fixed heap [reviewed by @gbtitus] Improve the startup time and NUMA affinity for configurations that use a fixed heap by interleaving and parallelizing the heap fault-in. High performance networks require that memory is registered with the NIC/HCA in order to do RDMA. We can either register all communicable memory at startup using a fixed heap or we can register memory dynamically at some point after it's been allocated in the user program. Static registration can offer better communication performance since there's just one registration call at startup and no lookups or registration at communication time. However, static registration causes slow startup because all memory is being faulted in at program startup and prior to this effort that was done serially as a side effect of registering memory with the NIC. Serial fault-in also resulted in poor NUMA affinity and ignored user first-touch. Effectively, this meant that most operations were just using memory out of NUMA domain 0, which created a bandwidth bottleneck. Because of slow startup and poor affinity we historically preferred dynamic registration when available (for gasnet-ibv we default to segment large instead of fast, for ugni we default we prefer dynamic registration.) This PR improves the situation for static registration by touching the heap in parallel prior to registration, which improves fault-in speed. We also interleave the memory faults so that pages are spread round-robin or cyclically across the NUMA domains. This results in better NUMA behavior since we're not just using NUMA domain 0. Half our memory references will still be wrong so NUMA affinity isn't really "better" we're just spreading load between the memory controllers. Here are some performance results for stream on a couple different platforms. Stream has no communication and is NUMA affinity sensitive. The tables below show the reported benchmark rate and the total execution time to show startup costs. Results for dynamic registration are shown as a best case comparison. Results have been rounded to make them easier to parse (nearest 5 GB/s and 1 second.) Generally speaking we see better, but not perfect performance and significant improvements in startup time. ```sh export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer chpl examples/benchmarks/hpcc/stream.chpl --fast ./stream -nl 8 --m=2861913600 ``` Cray XC: --- 16M hugepages for ugni. Static configs use `CHPL_RT_MAX_HEAP_SIZE=106G` | config | stream | runtime | | ------------------- | --------:| ------: | | ugni dynamic | 735 GB/s | 3s | | ugni static | 325 GB/s | 33s | | ugni static opt | 325 GB/s | 12s | | gn-aries static | 320 GB/s | 33s | | gn-aries static opt | 565 GB/s | 8s | ugni static registration is faster with this change, but NUMA affinity doesn't change because the system default of `HUGETLB_NO_RESERVE=no` means pages are pre-reserved before being faulted in. For gasnet-aries we can see this improves startup time and improves NUMA affinity. As expected it's not as good as user first-touch but it's better than before. Cray CS (Intel): --- 2M Transparent Huge Pages (THP). Static configs use `GASNET_PHYSMEM_MAX='106 GB' CHPL_RT_MAX_HEAP_SIZE=106G` | config | stream | runtime | | ---------------------- | --------:| ------: | | gn-ibv-large dynamic | 760 GB/s | 2s | | gn-ibv-fast static | 325 GB/s | 53s | | gn-ibv-fast static opt | 575 GB/s | 11s | Here we see the expected improvements to NUMA affinity and startup time for static registration under gasnet. Results for ofi on the same CS. These results are a little less obvious because tcp and verbs suffer from dynamic connection costs that hurt stream performance. The trends are the same though, it's just that raw stream performance is lower. | config | stream | runtime | | ---------------------- | --------:| ------: | | ofi-sockets no-reg | 750 GB/s | 2s | | ofi-tcp no-reg | 605 GB/s | 5s | | ofi-verbs static | 300 GB/s | 54s | | ofi-verbs static opt | 505 GB/s | 14s | Cray CS (AMD): --- 2M Transparent Huge Pages (THP) on Rome CPUs. Static configs use `GASNET_PHYSMEM_MAX='427 GB' CHPL_RT_MAX_HEAP_SIZE=427G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 1725 GB/s | 1s | | gn-ibv-fast static | 155 GB/s | 100s | | gn-ibv-fast static opt | 820 GB/s | 16s | Here the trends are the same as above but we can see the impact from getting NUMA affinity wrong on Rome chips is much worse than we've seen on Intel chips in the past. The startup time improvement is also slightly better which is good since these nodes have a lot of memory. Other: --- And some runs on Power, Arm, and AWS that have similar trends, but I wanted to check since Arm/Power have different page sizes and AWS is another interesting place to check ofi. <details> IB Power9: --- Powerpc with IB network. Power has 64K system pages. Static configs use `GASNET_PHYSMEM_MAX='212 GB' CHPL_RT_MAX_HEAP_SIZE=212G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 1060 GB/s | 2s | | gn-ibv-fast static | 335 GB/s | 21s | | gn-ibv-fast static opt | 410 GB/s | 9s | IB ARM: --- ARM with IB network. Arm also has 64K system pages. Static configs use `GASNET_PHYSMEM_MAX='24 GB' CHPL_RT_MAX_HEAP_SIZE=24G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 2355 GB/s | 3s | | gn-ibv-fast static | 720 GB/s | 6s | | gn-ibv-fast static opt | 1350 GB/s | 4s | AWS: --- AWS instances with ofi. Static configs use `CHPL_RT_MAX_HEAP_SIZE=64G` | config | stream | runtime | | -------------------- | ---------:| ------: | | ofi-sockets no reg | 1150 GB/s | 3s | | ofi-tcp no reg | 890 GB/s | 4s | | ofi-efa static | 510 GB/s | 30s | | of-efa static opt | 860 GB/s | 10s | </details> In terms of the actual implementation we basically create a thread per core, pin it to a specific NUMA domain and touch a page of memory in a round-robin fashion. This happens very early in program startup so we have to manually create and pin pthreads instead of using our tasking layer. This approach requires an accurate page size, but we don't have that for Transparent Huge Pages (THP) so we just use a minimum of 2M, which is the most common THP size. Longer term I'd like to use hwloc to set an interleave memory policy and then just touch large chunks of memory, but that requires libnuma and I didn't want to bring that in as a dependency for the initial implementation. That's captured as future work in Cray/chapel-private#1816 Resolves Cray/chapel-private#1088 Resolves Cray/chapel-private#1798 Helps #9166
Maxrimus
pushed a commit
to Maxrimus/chapel
that referenced
this issue
Apr 5, 2021
…-init Improve NUMA affinity and startup times for configs that use a fixed heap [reviewed by @gbtitus] Improve the startup time and NUMA affinity for configurations that use a fixed heap by interleaving and parallelizing the heap fault-in. High performance networks require that memory is registered with the NIC/HCA in order to do RDMA. We can either register all communicable memory at startup using a fixed heap or we can register memory dynamically at some point after it's been allocated in the user program. Static registration can offer better communication performance since there's just one registration call at startup and no lookups or registration at communication time. However, static registration causes slow startup because all memory is being faulted in at program startup and prior to this effort that was done serially as a side effect of registering memory with the NIC. Serial fault-in also resulted in poor NUMA affinity and ignored user first-touch. Effectively, this meant that most operations were just using memory out of NUMA domain 0, which created a bandwidth bottleneck. Because of slow startup and poor affinity we historically preferred dynamic registration when available (for gasnet-ibv we default to segment large instead of fast, for ugni we default we prefer dynamic registration.) This PR improves the situation for static registration by touching the heap in parallel prior to registration, which improves fault-in speed. We also interleave the memory faults so that pages are spread round-robin or cyclically across the NUMA domains. This results in better NUMA behavior since we're not just using NUMA domain 0. Half our memory references will still be wrong so NUMA affinity isn't really "better" we're just spreading load between the memory controllers. Here are some performance results for stream on a couple different platforms. Stream has no communication and is NUMA affinity sensitive. The tables below show the reported benchmark rate and the total execution time to show startup costs. Results for dynamic registration are shown as a best case comparison. Results have been rounded to make them easier to parse (nearest 5 GB/s and 1 second.) Generally speaking we see better, but not perfect performance and significant improvements in startup time. ```sh export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer chpl examples/benchmarks/hpcc/stream.chpl --fast ./stream -nl 8 --m=2861913600 ``` Cray XC: --- 16M hugepages for ugni. Static configs use `CHPL_RT_MAX_HEAP_SIZE=106G` | config | stream | runtime | | ------------------- | --------:| ------: | | ugni dynamic | 735 GB/s | 3s | | ugni static | 325 GB/s | 33s | | ugni static opt | 325 GB/s | 12s | | gn-aries static | 320 GB/s | 33s | | gn-aries static opt | 565 GB/s | 8s | ugni static registration is faster with this change, but NUMA affinity doesn't change because the system default of `HUGETLB_NO_RESERVE=no` means pages are pre-reserved before being faulted in. For gasnet-aries we can see this improves startup time and improves NUMA affinity. As expected it's not as good as user first-touch but it's better than before. Cray CS (Intel): --- 2M Transparent Huge Pages (THP). Static configs use `GASNET_PHYSMEM_MAX='106 GB' CHPL_RT_MAX_HEAP_SIZE=106G` | config | stream | runtime | | ---------------------- | --------:| ------: | | gn-ibv-large dynamic | 760 GB/s | 2s | | gn-ibv-fast static | 325 GB/s | 53s | | gn-ibv-fast static opt | 575 GB/s | 11s | Here we see the expected improvements to NUMA affinity and startup time for static registration under gasnet. Results for ofi on the same CS. These results are a little less obvious because tcp and verbs suffer from dynamic connection costs that hurt stream performance. The trends are the same though, it's just that raw stream performance is lower. | config | stream | runtime | | ---------------------- | --------:| ------: | | ofi-sockets no-reg | 750 GB/s | 2s | | ofi-tcp no-reg | 605 GB/s | 5s | | ofi-verbs static | 300 GB/s | 54s | | ofi-verbs static opt | 505 GB/s | 14s | Cray CS (AMD): --- 2M Transparent Huge Pages (THP) on Rome CPUs. Static configs use `GASNET_PHYSMEM_MAX='427 GB' CHPL_RT_MAX_HEAP_SIZE=427G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 1725 GB/s | 1s | | gn-ibv-fast static | 155 GB/s | 100s | | gn-ibv-fast static opt | 820 GB/s | 16s | Here the trends are the same as above but we can see the impact from getting NUMA affinity wrong on Rome chips is much worse than we've seen on Intel chips in the past. The startup time improvement is also slightly better which is good since these nodes have a lot of memory. Other: --- And some runs on Power, Arm, and AWS that have similar trends, but I wanted to check since Arm/Power have different page sizes and AWS is another interesting place to check ofi. <details> IB Power9: --- Powerpc with IB network. Power has 64K system pages. Static configs use `GASNET_PHYSMEM_MAX='212 GB' CHPL_RT_MAX_HEAP_SIZE=212G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 1060 GB/s | 2s | | gn-ibv-fast static | 335 GB/s | 21s | | gn-ibv-fast static opt | 410 GB/s | 9s | IB ARM: --- ARM with IB network. Arm also has 64K system pages. Static configs use `GASNET_PHYSMEM_MAX='24 GB' CHPL_RT_MAX_HEAP_SIZE=24G` | config | stream | runtime | | ---------------------- | ---------:| ------: | | gn-ibv-large dynamic | 2355 GB/s | 3s | | gn-ibv-fast static | 720 GB/s | 6s | | gn-ibv-fast static opt | 1350 GB/s | 4s | AWS: --- AWS instances with ofi. Static configs use `CHPL_RT_MAX_HEAP_SIZE=64G` | config | stream | runtime | | -------------------- | ---------:| ------: | | ofi-sockets no reg | 1150 GB/s | 3s | | ofi-tcp no reg | 890 GB/s | 4s | | ofi-efa static | 510 GB/s | 30s | | of-efa static opt | 860 GB/s | 10s | </details> In terms of the actual implementation we basically create a thread per core, pin it to a specific NUMA domain and touch a page of memory in a round-robin fashion. This happens very early in program startup so we have to manually create and pin pthreads instead of using our tasking layer. This approach requires an accurate page size, but we don't have that for Transparent Huge Pages (THP) so we just use a minimum of 2M, which is the most common THP size. Longer term I'd like to use hwloc to set an interleave memory policy and then just touch large chunks of memory, but that requires libnuma and I didn't want to bring that in as a dependency for the initial implementation. That's captured as future work in https://github.com/Cray/chapel-private/issues/1816 Resolves https://github.com/Cray/chapel-private/issues/1088 Resolves https://github.com/Cray/chapel-private/issues/1798 Helps chapel-lang#9166
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Gasnet-aries startup time (really the time for a no-op program, so startup and teardown) currently lags behind ugni (30s vs 5s): https://chapel-lang.org/perf/16-node-xc/?startdate=2018/01/05&enddate=2018/04/09&graphs=noopnonusercodestartuptime
On 28-core (56 HT) broadwell nodes with 128 GB of RAM:
A no-op program takes:
This is almost certainly because we set a huge segment size (90% of available mem), so a lot of time is spent allocating and faulting in and registering memory. With a lower heap size timings are much faster (and under ugni we dynamically allocate/register most memory, so in a no-op program very little time will be spent allocating/registering):
We also see some increased exit times with
GASNET_DOMAIN_COUNT
set (#7251). Lowering that also speeds things up slightly:The text was updated successfully, but these errors were encountered: