Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize calls to gasnet_AMPoll for IBV/Aries #14912

Merged
merged 2 commits into from
Feb 18, 2020

Conversation

ronawho
Copy link
Contributor

@ronawho ronawho commented Feb 15, 2020

In some configurations (notably gasnet-ibv, especially with EPYC processors)
there can be significant contention from concurrent AM polls. Serialize our
calls under ibv to reduce that contention. This significantly improves
performance of concurrent blocking on-stmts/active-messages.

For ra-on with 48-core Intel Cascade Lake CPUs we see a ~2x speedup and with
48-core AMD Rome CPUs we see a 55x speedup. This change was motivated by
seeing a large performance difference on AMD EPYC processors, but happily it
has helped both Intel and AMD chips.

For 36-core Broadwell nodes (where our nightly ibv performance testing runs)
we see the following improvements:

  • 85% speedup for SSCA
  • 75% speedup for ra-on
  • 35% speedup for ra-atomics
  • 30% speedup for bale histogram
  • 20% speedup for fft
  • 15% speedup for lulesh

Resolves #14893
Resolves https://github.com/Cray/chapel-private/issues/747

This also serializes calls for gasnet-aries. There are similar, but more
modest improvements in that configuration as well. ibv and aries substrates
use RDMA for PUTs/GETs, but udp (amudp) and mpi (ammpi) use active messages
for most operations. Serializing for udp hurt performance and mpi already
serializes AMs internally so there's no need for the extra serialization.

In some configurations (notably gasnet-ibv, especially with EPYC processors)
there can be significant contention from concurrent AM polls. Serialize our
calls to reduce that contention. This significantly improves performance of
concurrent blocking on-stmts/active-messages.

For ra-on with 48-core Intel Cascade Lake CPUs we see a ~2x speedup and with
48-core AMD Rome CPUs we see a 55x speedup. This change was motivated by seeing
a large performance difference on AMD EPYC processors, but happily it has
helped both Intel and AMD chips.

For 36-core Broadwell nodes (where our nightly ibv performance testing runs) we
see the following improvements:
 - 85% speedup for SSCA
 - 75% speedup for ra-on
 - 35% speedup for ra-atomics
 - 30% speedup for bale histogram
 - 20% speedup for fft
 - 15% speedup for lulesh
The ibv and aries substrates use RDMA for PUTs/GETs, but udp (amudp) and
mpi (ammpi) use active messages for most operations. Serializing for udp
hurt performance and mpi already serializes AMs internally so there's no
need for the extra serialization.
@ronawho ronawho changed the title Serialize calls to gasnet_AMPoll Serialize calls to gasnet_AMPoll for IBV/Aries Feb 18, 2020
Copy link
Member

@gbtitus gbtitus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

@gbtitus
Copy link
Member

gbtitus commented Feb 18, 2020

Presumably gasnet_AMPoll() itself must serialize internally somehow. So does the large improvement this change makes indicate that internal serialization could be improved?

@ronawho
Copy link
Contributor Author

ronawho commented Feb 18, 2020

#14893 has a lot more info.

gasnet_AMPoll() is only serialized by gasnet for the mpi substate. For ibv and aries there is no serialization at the gasnet level, so it results in concurrent calls to things like ibv_poll_cq() and GNI_CqGetEvent(). Those calls are presumably serialized in some fashion under the covers, but without a lot more digging I'm not sure how.

#14893 (comment) is my best guess as to why we only see the behavior with qthreads (but it's just speculation.)

@ronawho ronawho merged commit 0fe9827 into chapel-lang:master Feb 18, 2020
@ronawho ronawho deleted the serialize-gasnet-polling branch February 18, 2020 22:12
ronawho added a commit that referenced this pull request Feb 20, 2020
Poll for AMs on every call to gasnet_AMPoll for Aries

[reviewed by @gbtitus]

With gasnet-aries multi-domain feature `GASNET_AM_DOMAIN_POLL_MASK`
controls how often threads calling `gasnet_AMPoll()` will actually poll.
This is a mechanism to limit contention from concurrent polling. Now
that we're serializing calls to `gasnet_AMPoll()` (#14912) there is no
need for this other contention mitigation strategy, so disable it.

This significantly improves the performance of coforall+ons, since
remote nodes aren't wasting time doing no-op polls before picking up a
new task. Performance improvements:
 - 2x improvement for coforall+on microbenchmark
 - 15% improvement for fft
 - 10% improvement for bale histogram and ra-atomic
 - 10% improvement for hpl
 - 10% improvement for lulesh
e-kayrakli added a commit that referenced this pull request Feb 21, 2020
Add performance annotations

Adds annotation for the following perf changes and the responsible PRs

- Gasnet Performance improvements on cs xc
    #14912
- Memory leak
    #14907
- Compiler performance
    #14887
- String temporary copy performance
    #14903 
- Unordered GET improvements on ofi
    #14810

[Reviewed by @ronawho]
ronawho added a commit that referenced this pull request Mar 17, 2021
Serialize gasnet polling calls for UCX too

[reviewed by @gbtitus]

Similar to #14912, but for UCX. Concurrent polling can cause contention
and performance regressions so serialize the calls. This improves
performance for some of the same benchmarks we saw in #14912 but the
difference is not as dramatic as it was for IBV.

Some performance results on 48-core Cascade Lake nodes with HDR IB:

| config | RA-on        | Indexgather    |
| ------ | -----------: | -------------: |
| before | 0.00245 GUPS | 1010 MB/s/node |
| after  | 0.00275 GUPS | 1120 MB/s/node |

And 128-core Rome nodes with HDR IB:

| config | RA-on        | Indexgather    |
| ------ | -----------: | -------------: |
| before | 0.00150 GUPS |  670 MB/s/node |
| after  | 0.00175 GUPS | 1060 MB/s/node |
Maxrimus pushed a commit to Maxrimus/chapel that referenced this pull request Apr 5, 2021
Serialize gasnet polling calls for UCX too

[reviewed by @gbtitus]

Similar to chapel-lang#14912, but for UCX. Concurrent polling can cause contention
and performance regressions so serialize the calls. This improves
performance for some of the same benchmarks we saw in chapel-lang#14912 but the
difference is not as dramatic as it was for IBV.

Some performance results on 48-core Cascade Lake nodes with HDR IB:

| config | RA-on        | Indexgather    |
| ------ | -----------: | -------------: |
| before | 0.00245 GUPS | 1010 MB/s/node |
| after  | 0.00275 GUPS | 1120 MB/s/node |

And 128-core Rome nodes with HDR IB:

| config | RA-on        | Indexgather    |
| ------ | -----------: | -------------: |
| before | 0.00150 GUPS |  670 MB/s/node |
| after  | 0.00175 GUPS | 1060 MB/s/node |
ronawho added a commit that referenced this pull request Jun 17, 2022
Improve our gasnet support for the ofi conduit

GASNet-EX 2022.3.0 significantly improved the ofi conduit in order to
support Slingshot 10/11 networks and restored support for Omni-Path
networks. Now that ofi support is back, improve our usage of it.

For the chplenv, change the default conduit to be ofi when gasnet is
used on hpe-cray-ex systems and ensure segment fast is used in those
cases. Note that we don't always default to segment fast since that
isn't helpful on omni-path systems.

In the gasnet shim, serialize active-message polling for ofi (this is
similar to #14912 and #17419) and parallelize the heap fault-in for ofi
segment fast (similar to #17405). This results in significant performance
improvements for gasnet-ofi. For instance, parallelizing the heap
fault-in takes 16 node indexgather on SS-11 from 620 MB/s/node to 730
MB/s/node and serializing polling takes it up to 3875 MB/s/node. On an
omnipath system we see serialization takes us from 415 MB/s/node to 490
MB/s/node.

There are a few places where we expect gasnet-ofi might be our default.
This is definitely true for omni-path systems, where ofi with the psm2
provider is recommended. Note that our native `CHPL_COMM=ofi` layer does
not work with psm2 and we don't expect to put in the effort to get it
working (beyond the comm layer, we would also need spawning and
out-of-band support that I don't think is worth adding currently.) On
Slingshot-10 systems it's still up in the air if gasnet-ofi, gasnet-ucx,
or our native ofi comm layer will be best in the long term. Currently,
our native ofi layer is not working there, but this is a bug we need to
address. And lastly it's possible to use gasnet-ofi on Slingshot-11
systems, but we expect our native ofi comm layer to be the preferred
option since that's what we've mostly been developing it for. This is
much like using our native ugni layer on Aries systems instead of
gasnet-aries because it gives us more control on flagship HPE/Cray
systems.

In order to evaluate the current state of gasnet-ofi and what we might
recommend to users I gathered performance figures on a few systems for 5
benchmarks that expose different patterns/idioms we care about:
 - Stream (no communication, numa affinity sensitive)
 - PRK-stencil (little comm, numa affinity sensitive)
 - ISx (concurrent bulk comm, numa affinity sensitive)
 - Indexgather (concurrent bulk/aggregated comm)
 - RA (concurrent fine-grained comm -- RDMA (rmo) and AM (on) variants)

```
chpl --fast test/release/examples/benchmarks/hpcc/stream.chpl
chpl --fast test/studies/prk/Stencil/optimized/stencil-opt.chpl -sorder="sqrt(16e9*numLocales / 8):int"
chpl --fast test/studies/isx/isx-hand-optimized.chpl -smode=scaling.weakISO
chpl --fast test/studies/bale/indexgather/ig.chpl -sN=10000000 -sprintStats -smode=Mode.aggregated
chpl --fast test/release/examples/benchmarks/hpcc/ra.chpl -sverify=false -suseOn=false -sN_U="2**(n-12)" -o ra-rmo
chpl --fast test/release/examples/benchmarks/hpcc/ra.chpl -sverify=false -suseOn=true -sN_U="2**(n-12)" -o ra-on

./stream -nl 16
./stencil-opt -nl 16
./isx-hand-optimized -nl 16
./ig -nl 16
./ra-rmo -nl 16
./ra-on -nl 16
```

Omni-path:
----------

16-nodes of an OPA cluster with 32 cores and 192 GB of ram per node:

| Config    | Stream    | PRK-Stencil  | ISx   | Indexgather   | RA-rmo       | RA-on        |
| --------- | --------: | -----------: | ----: | ------------: | -----------: | -----------: |
| Gasnet-1  | 2120 GB/s | 940 GFlops/s | 13.1s | 400 MB/s/node | 0.00082 GUPS | 0.00059 GUPS |
| 2021.9.0  | 2120 GB/s | 940 GFlops/s | 14.6s | 290 MB/s/node | 0.00045 GUPS | 0.00059 GUPS |
| 2022.3.0  | 2120 GB/s | 940 GFlops/s | 13.3s | 415 MB/s/node | 0.00086 GUPS | 0.00058 GUPS |
| ser-poll  | 2120 GB/s | 945 GFlops/s | 12.3s | 495 MB/s/node | 0.00086 GUPS | 0.00174 GUPS |

Previously, omni-path users had to revert to gasnet-1 and the initial
ofi support added in 2021.9.0 hurt performance. What we see now is that
2022.3.0 restores performance to gasnet-1 levels and serializing the
poller further improves performance. This makes me comfortable telling
users to stop falling back to gasnet-1 and just use the current support
for omni-path.

Slingshot-10:
-------------

16-nodes of a SS-10 system with 128 cores and 512 GB of ram per node:

| Config     | Stream    | PRK-Stencil   | ISx   | Indexgather    | RA-rmo      | RA-on       |
| ---------- | --------: | ------------: | ----: | -------------: | ----------: | ----------: |
| gasnet-ofi | 2335 GB/s | 1435 GFlops/s | 31.7s | 1830 MB/s/node | 0.0033 GUPS | 0.0030 GUPS |
| gasnet-ucx | 2355 GB/s | 1420 GFlops/s | 16.4s | 2290 MB/s/node | 0.0021 GUPS | 0.0015 GUPS |

Generally speaking, gasnet-ucx seems to perform the best on SS-10. I
should also note that our native `CHPL_COMM=ofi` layer does not appear
to be working correctly on SS-10 so I don't have that to compare to,
though we'll want to get that working in the near term. Also note that
serializing polling was important for performance for ofi on SS-10.

Slingshot-11:
-------------

16-nodes of a SS-11 system with 128 cores and 512 GB of ram per node:

| Config     | Stream    | PRK-Stencil   | ISx   | Indexgather    | RA-rmo     | RA-on      |
| ---------- | --------: | ------------: | ----: | -------------: | ---------: | ---------: |
| gasnet-ofi | 2460 GB/s | 1470 MFlops/s | 15.2s | 3875 MB/s/node | 0.003 GUPS | 0.004 GUPS |
| comm=ofi   | 2565 GB/s | 1540 MFlops/s |  7.3s | 5030 MB/s/node | 0.132 GUPS | 0.018 GUPS

Our native `CHPL_COMM=ofi` layer generally outperforms gasnet-ofi. This
is especially true for concurrent fine-grained operations like RA where
injection is serialized under gasnet currently. Note that much like ugni
on XC, we expect our ofi comm layer to be the native/default layer on
SS-11, but the comparison is still interesting and I expect we can
improve gasnet-ofi SS-11 performance with some tuning.

Summary:
--------

Overall, GASNet-EX 2022.3.0 significantly improved ofi performance, and
bringing in optimizations we applied to other conduits further improved
performance. I recommend gasnet-ex for omni-path in the short and long
term, gasnet-ucx for ss-10 in the short term and TBD for the long term,
and our native ofi layer for ss-11. The other potential targets for ofi
in the future are AWS EFA and Cornelis Omni-Path, but these will require
more exploration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow AMs with qthreads+gasnet-ibv on EPYC processors
2 participants