Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve gasnet-aries startup time #9166

Closed
ronawho opened this issue Apr 10, 2018 · 2 comments
Closed

Improve gasnet-aries startup time #9166

ronawho opened this issue Apr 10, 2018 · 2 comments

Comments

@ronawho
Copy link
Contributor

ronawho commented Apr 10, 2018

Gasnet-aries startup time (really the time for a no-op program, so startup and teardown) currently lags behind ugni (30s vs 5s): https://chapel-lang.org/perf/16-node-xc/?startdate=2018/01/05&enddate=2018/04/09&graphs=noopnonusercodestartuptime

On 28-core (56 HT) broadwell nodes with 128 GB of RAM:

A no-op program takes:

export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
$ ./no-op -nl1

real 32.02

This is almost certainly because we set a huge segment size (90% of available mem), so a lot of time is spent allocating and faulting in and registering memory. With a lower heap size timings are much faster (and under ugni we dynamically allocate/register most memory, so in a no-op program very little time will be spent allocating/registering):

export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
export CHPL_RT_MAX_HEAP_SIZE=4G
export CHPL_RT_CALL_STACK_SIZE=1M
$ ./no-op -nl1

real 3.811

We also see some increased exit times with GASNET_DOMAIN_COUNT set (#7251). Lowering that also speeds things up slightly:

export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
export CHPL_RT_MAX_HEAP_SIZE=4G
export CHPL_RT_CALL_STACK_SIZE=1M
export GASNET_DOMAIN_COUNT=1
$ ./no-op -nl1

real 2.579
@ronawho
Copy link
Contributor Author

ronawho commented Nov 14, 2019

This will be an issue so long as static registration is used. Ugni is our preferred/recommended configuration for crays so I'm going to close this, but I've added it to #5703 (comment)

@ronawho
Copy link
Contributor Author

ronawho commented Mar 15, 2021

This will be improved by #17405, which parallelizes the heap fault in. Running a no-op program on an XC with 128 GB of memory before that PR:

export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
chpl test/performance/elliot/no-op.chpl --fast
./no-op -nl 1
> 32.28

And with that PR:

export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
chpl test/performance/elliot/no-op.chpl --fast
./no-op -nl 1
> real 5.911

ronawho added a commit that referenced this issue Mar 17, 2021
Improve NUMA affinity and startup times for configs that use a fixed heap

[reviewed by @gbtitus]

Improve the startup time and NUMA affinity for configurations that use a
fixed heap by interleaving and parallelizing the heap fault-in. High
performance networks require that memory is registered with the NIC/HCA
in order to do RDMA.  We can either register all communicable memory at
startup using a fixed heap or we can register memory dynamically at some
point after it's been allocated in the user program.

Static registration can offer better communication performance since
there's just one registration call at startup and no lookups or
registration at communication time. However, static registration causes
slow startup because all memory is being faulted in at program startup
and prior to this effort that was done serially as a side effect of
registering memory with the NIC. Serial fault-in also resulted in poor
NUMA affinity and ignored user first-touch. Effectively, this meant that
most operations were just using memory out of NUMA domain 0, which
created a bandwidth bottleneck. Because of slow startup and poor
affinity we historically preferred dynamic registration when available
(for gasnet-ibv we default to segment large instead of fast, for ugni we
default we prefer dynamic registration.)

This PR improves the situation for static registration by touching the
heap in parallel prior to registration, which improves fault-in speed.
We also interleave the memory faults so that pages are spread
round-robin or cyclically across the NUMA domains. This results in
better NUMA behavior since we're not just using NUMA domain 0. Half our
memory references will still be wrong so NUMA affinity isn't really
"better" we're just spreading load between the memory controllers.

Here are some performance results for stream on a couple different
platforms. Stream has no communication and is NUMA affinity sensitive.
The tables below show the reported benchmark rate and the total
execution time to show startup costs. Results for dynamic registration
are shown as a best case comparison. Results have been rounded to make
them easier to parse (nearest 5 GB/s and 1 second.) Generally speaking
we see better, but not perfect performance and significant improvements
in startup time.

```sh
export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
chpl examples/benchmarks/hpcc/stream.chpl --fast
./stream -nl 8 --m=2861913600
```

Cray XC:

---

16M hugepages for ugni. Static configs use `CHPL_RT_MAX_HEAP_SIZE=106G`

| config              | stream   | runtime |
| ------------------- | --------:| ------: |
| ugni dynamic        | 735 GB/s |  3s     |
| ugni static         | 325 GB/s | 33s     |
| ugni static opt     | 325 GB/s | 12s     |
| gn-aries static     | 320 GB/s | 33s     |
| gn-aries static opt | 565 GB/s |  8s     |

ugni static registration is faster with this change, but NUMA affinity
doesn't change because the system default of `HUGETLB_NO_RESERVE=no`
means pages are pre-reserved before being faulted in.

For gasnet-aries we can see this improves startup time and improves NUMA
affinity. As expected it's not as good as user first-touch but it's
better than before.

Cray CS (Intel):
---

2M Transparent Huge Pages (THP). Static configs use
`GASNET_PHYSMEM_MAX='106 GB' CHPL_RT_MAX_HEAP_SIZE=106G`

| config                 | stream   | runtime |
| ---------------------- | --------:| ------: |
| gn-ibv-large dynamic   | 760 GB/s |  2s     |
| gn-ibv-fast static     | 325 GB/s | 53s     |
| gn-ibv-fast static opt | 575 GB/s | 11s     |

Here we see the expected improvements to NUMA affinity and startup time
for static registration under gasnet.

Results for ofi on the same CS. These results are a little less obvious
because tcp and verbs suffer from dynamic connection costs that hurt
stream performance. The trends are the same though, it's just that raw
stream performance is lower.

| config                 | stream   | runtime |
| ---------------------- | --------:| ------: |
| ofi-sockets no-reg     | 750 GB/s |  2s     |
| ofi-tcp no-reg         | 605 GB/s |  5s     |
| ofi-verbs static       | 300 GB/s | 54s     |
| ofi-verbs static opt   | 505 GB/s | 14s     |

Cray CS (AMD):
---

2M Transparent Huge Pages (THP) on Rome CPUs. Static configs use
`GASNET_PHYSMEM_MAX='427 GB' CHPL_RT_MAX_HEAP_SIZE=427G`

| config                 | stream    | runtime |
| ---------------------- | ---------:| ------: |
| gn-ibv-large dynamic   | 1725 GB/s |   1s    |
| gn-ibv-fast static     |  155 GB/s | 100s    |
| gn-ibv-fast static opt |  820 GB/s |  16s    |

Here the trends are the same as above but we can see the impact from
getting NUMA affinity wrong on Rome chips is much worse than we've seen
on Intel chips in the past. The startup time improvement is also
slightly better which is good since these nodes have a lot of memory.

Other:
---

And some runs on Power, Arm, and AWS that have similar trends, but I
wanted to check since Arm/Power have different page sizes and AWS is
another interesting place to check ofi.

<details>

IB Power9:
---

Powerpc with IB network. Power has 64K system pages. Static configs use
`GASNET_PHYSMEM_MAX='212 GB' CHPL_RT_MAX_HEAP_SIZE=212G`

| config                 | stream    | runtime |
| ---------------------- | ---------:| ------: |
| gn-ibv-large dynamic   | 1060 GB/s |  2s     |
| gn-ibv-fast static     |  335 GB/s | 21s     |
| gn-ibv-fast static opt |  410 GB/s |  9s     |

IB ARM:
---

ARM with IB network. Arm also has 64K system pages. Static configs use
`GASNET_PHYSMEM_MAX='24 GB' CHPL_RT_MAX_HEAP_SIZE=24G`

| config                 | stream    | runtime |
| ---------------------- | ---------:| ------: |
| gn-ibv-large dynamic   | 2355 GB/s | 3s      |
| gn-ibv-fast static     |  720 GB/s | 6s      |
| gn-ibv-fast static opt | 1350 GB/s | 4s      |

AWS:
---

AWS instances with ofi. Static configs use `CHPL_RT_MAX_HEAP_SIZE=64G`

| config               | stream    | runtime |
| -------------------- | ---------:| ------: |
| ofi-sockets no reg   | 1150 GB/s |  3s     |
| ofi-tcp no reg       |  890 GB/s |  4s     |
| ofi-efa static       |  510 GB/s | 30s     |
| of-efa static opt    |  860 GB/s | 10s     |

</details>

In terms of the actual implementation we basically create a thread per
core, pin it to a specific NUMA domain and touch a page of memory in a
round-robin fashion. This happens very early in program startup so we
have to manually create and pin pthreads instead of using our tasking
layer. This approach requires an accurate page size, but we don't have
that for Transparent Huge Pages (THP) so we just use a minimum of 2M,
which is the most common THP size. Longer term I'd like to use hwloc to
set an interleave memory policy and then just touch large chunks of
memory, but that requires libnuma and I didn't want to bring that in as
a dependency for the initial implementation. That's captured as future
work in Cray/chapel-private#1816

Resolves Cray/chapel-private#1088
Resolves Cray/chapel-private#1798
Helps #9166
Maxrimus pushed a commit to Maxrimus/chapel that referenced this issue Apr 5, 2021
…-init

Improve NUMA affinity and startup times for configs that use a fixed heap

[reviewed by @gbtitus]

Improve the startup time and NUMA affinity for configurations that use a
fixed heap by interleaving and parallelizing the heap fault-in. High
performance networks require that memory is registered with the NIC/HCA
in order to do RDMA.  We can either register all communicable memory at
startup using a fixed heap or we can register memory dynamically at some
point after it's been allocated in the user program.

Static registration can offer better communication performance since
there's just one registration call at startup and no lookups or
registration at communication time. However, static registration causes
slow startup because all memory is being faulted in at program startup
and prior to this effort that was done serially as a side effect of
registering memory with the NIC. Serial fault-in also resulted in poor
NUMA affinity and ignored user first-touch. Effectively, this meant that
most operations were just using memory out of NUMA domain 0, which
created a bandwidth bottleneck. Because of slow startup and poor
affinity we historically preferred dynamic registration when available
(for gasnet-ibv we default to segment large instead of fast, for ugni we
default we prefer dynamic registration.)

This PR improves the situation for static registration by touching the
heap in parallel prior to registration, which improves fault-in speed.
We also interleave the memory faults so that pages are spread
round-robin or cyclically across the NUMA domains. This results in
better NUMA behavior since we're not just using NUMA domain 0. Half our
memory references will still be wrong so NUMA affinity isn't really
"better" we're just spreading load between the memory controllers.

Here are some performance results for stream on a couple different
platforms. Stream has no communication and is NUMA affinity sensitive.
The tables below show the reported benchmark rate and the total
execution time to show startup costs. Results for dynamic registration
are shown as a best case comparison. Results have been rounded to make
them easier to parse (nearest 5 GB/s and 1 second.) Generally speaking
we see better, but not perfect performance and significant improvements
in startup time.

```sh
export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
chpl examples/benchmarks/hpcc/stream.chpl --fast
./stream -nl 8 --m=2861913600
```

Cray XC:

---

16M hugepages for ugni. Static configs use `CHPL_RT_MAX_HEAP_SIZE=106G`

| config              | stream   | runtime |
| ------------------- | --------:| ------: |
| ugni dynamic        | 735 GB/s |  3s     |
| ugni static         | 325 GB/s | 33s     |
| ugni static opt     | 325 GB/s | 12s     |
| gn-aries static     | 320 GB/s | 33s     |
| gn-aries static opt | 565 GB/s |  8s     |

ugni static registration is faster with this change, but NUMA affinity
doesn't change because the system default of `HUGETLB_NO_RESERVE=no`
means pages are pre-reserved before being faulted in.

For gasnet-aries we can see this improves startup time and improves NUMA
affinity. As expected it's not as good as user first-touch but it's
better than before.

Cray CS (Intel):
---

2M Transparent Huge Pages (THP). Static configs use
`GASNET_PHYSMEM_MAX='106 GB' CHPL_RT_MAX_HEAP_SIZE=106G`

| config                 | stream   | runtime |
| ---------------------- | --------:| ------: |
| gn-ibv-large dynamic   | 760 GB/s |  2s     |
| gn-ibv-fast static     | 325 GB/s | 53s     |
| gn-ibv-fast static opt | 575 GB/s | 11s     |

Here we see the expected improvements to NUMA affinity and startup time
for static registration under gasnet.

Results for ofi on the same CS. These results are a little less obvious
because tcp and verbs suffer from dynamic connection costs that hurt
stream performance. The trends are the same though, it's just that raw
stream performance is lower.

| config                 | stream   | runtime |
| ---------------------- | --------:| ------: |
| ofi-sockets no-reg     | 750 GB/s |  2s     |
| ofi-tcp no-reg         | 605 GB/s |  5s     |
| ofi-verbs static       | 300 GB/s | 54s     |
| ofi-verbs static opt   | 505 GB/s | 14s     |

Cray CS (AMD):
---

2M Transparent Huge Pages (THP) on Rome CPUs. Static configs use
`GASNET_PHYSMEM_MAX='427 GB' CHPL_RT_MAX_HEAP_SIZE=427G`

| config                 | stream    | runtime |
| ---------------------- | ---------:| ------: |
| gn-ibv-large dynamic   | 1725 GB/s |   1s    |
| gn-ibv-fast static     |  155 GB/s | 100s    |
| gn-ibv-fast static opt |  820 GB/s |  16s    |

Here the trends are the same as above but we can see the impact from
getting NUMA affinity wrong on Rome chips is much worse than we've seen
on Intel chips in the past. The startup time improvement is also
slightly better which is good since these nodes have a lot of memory.

Other:
---

And some runs on Power, Arm, and AWS that have similar trends, but I
wanted to check since Arm/Power have different page sizes and AWS is
another interesting place to check ofi.

<details>

IB Power9:
---

Powerpc with IB network. Power has 64K system pages. Static configs use
`GASNET_PHYSMEM_MAX='212 GB' CHPL_RT_MAX_HEAP_SIZE=212G`

| config                 | stream    | runtime |
| ---------------------- | ---------:| ------: |
| gn-ibv-large dynamic   | 1060 GB/s |  2s     |
| gn-ibv-fast static     |  335 GB/s | 21s     |
| gn-ibv-fast static opt |  410 GB/s |  9s     |

IB ARM:
---

ARM with IB network. Arm also has 64K system pages. Static configs use
`GASNET_PHYSMEM_MAX='24 GB' CHPL_RT_MAX_HEAP_SIZE=24G`

| config                 | stream    | runtime |
| ---------------------- | ---------:| ------: |
| gn-ibv-large dynamic   | 2355 GB/s | 3s      |
| gn-ibv-fast static     |  720 GB/s | 6s      |
| gn-ibv-fast static opt | 1350 GB/s | 4s      |

AWS:
---

AWS instances with ofi. Static configs use `CHPL_RT_MAX_HEAP_SIZE=64G`

| config               | stream    | runtime |
| -------------------- | ---------:| ------: |
| ofi-sockets no reg   | 1150 GB/s |  3s     |
| ofi-tcp no reg       |  890 GB/s |  4s     |
| ofi-efa static       |  510 GB/s | 30s     |
| of-efa static opt    |  860 GB/s | 10s     |

</details>

In terms of the actual implementation we basically create a thread per
core, pin it to a specific NUMA domain and touch a page of memory in a
round-robin fashion. This happens very early in program startup so we
have to manually create and pin pthreads instead of using our tasking
layer. This approach requires an accurate page size, but we don't have
that for Transparent Huge Pages (THP) so we just use a minimum of 2M,
which is the most common THP size. Longer term I'd like to use hwloc to
set an interleave memory policy and then just touch large chunks of
memory, but that requires libnuma and I didn't want to bring that in as
a dependency for the initial implementation. That's captured as future
work in https://github.com/Cray/chapel-private/issues/1816

Resolves https://github.com/Cray/chapel-private/issues/1088
Resolves https://github.com/Cray/chapel-private/issues/1798
Helps chapel-lang#9166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant