Serialize gasnet polling calls for UCX too #17419

ronawho · 2021-03-17T19:07:40Z

Similar to #14912, but for UCX. Concurrent polling can cause contention
and performance regressions so serialize the calls. This improves
performance for some of the same benchmarks we saw in #14912 but the
difference is not as dramatic as it was for IBV.

Some performance results on 48-core Cascade Lake nodes with HDR IB:

config	RA-on	Indexgather
before	0.00245 GUPS	1010 MB/s/node
after	0.00275 GUPS	1120 MB/s/node

And 128-core Rome nodes with HDR IB:

config	RA-on	Indexgather
before	0.00150 GUPS	670 MB/s/node
after	0.00175 GUPS	1060 MB/s/node

Similar to 14912, but for UCX. Concurrent polling can cause contention and performance regressions so serialize the calls. This improves performance for some of the same benchmarks we saw in 14912 but the difference is not as dramatic as it was for IBV. Some performance results on 48-core Cascade Lake nodes with HDR IB: | config | RA-on | Indexgather | | ------ | -----------: | -------------: | | before | 0.00245 GUPS | 1010 MB/s/node | | after | 0.00275 GUPS | 1120 MB/s/node | And 128-core Rome nodes with HDR IB: | config | RA-on | Indexgather | | ------ | -----------: | -------------: | | before | 0.00150 GUPS | 670 MB/s/node | | after | 0.00175 GUPS | 1060 MB/s/node | Signed-off-by: Elliot Ronaghan <ronawho@gmail.com>

gbtitus

👍

@gbtitus

Serialize gasnet polling calls for UCX too [reviewed by @gbtitus] Similar to chapel-lang#14912, but for UCX. Concurrent polling can cause contention and performance regressions so serialize the calls. This improves performance for some of the same benchmarks we saw in chapel-lang#14912 but the difference is not as dramatic as it was for IBV. Some performance results on 48-core Cascade Lake nodes with HDR IB: | config | RA-on | Indexgather | | ------ | -----------: | -------------: | | before | 0.00245 GUPS | 1010 MB/s/node | | after | 0.00275 GUPS | 1120 MB/s/node | And 128-core Rome nodes with HDR IB: | config | RA-on | Indexgather | | ------ | -----------: | -------------: | | before | 0.00150 GUPS | 670 MB/s/node | | after | 0.00175 GUPS | 1060 MB/s/node |

Improve our gasnet support for the ofi conduit GASNet-EX 2022.3.0 significantly improved the ofi conduit in order to support Slingshot 10/11 networks and restored support for Omni-Path networks. Now that ofi support is back, improve our usage of it. For the chplenv, change the default conduit to be ofi when gasnet is used on hpe-cray-ex systems and ensure segment fast is used in those cases. Note that we don't always default to segment fast since that isn't helpful on omni-path systems. In the gasnet shim, serialize active-message polling for ofi (this is similar to #14912 and #17419) and parallelize the heap fault-in for ofi segment fast (similar to #17405). This results in significant performance improvements for gasnet-ofi. For instance, parallelizing the heap fault-in takes 16 node indexgather on SS-11 from 620 MB/s/node to 730 MB/s/node and serializing polling takes it up to 3875 MB/s/node. On an omnipath system we see serialization takes us from 415 MB/s/node to 490 MB/s/node. There are a few places where we expect gasnet-ofi might be our default. This is definitely true for omni-path systems, where ofi with the psm2 provider is recommended. Note that our native `CHPL_COMM=ofi` layer does not work with psm2 and we don't expect to put in the effort to get it working (beyond the comm layer, we would also need spawning and out-of-band support that I don't think is worth adding currently.) On Slingshot-10 systems it's still up in the air if gasnet-ofi, gasnet-ucx, or our native ofi comm layer will be best in the long term. Currently, our native ofi layer is not working there, but this is a bug we need to address. And lastly it's possible to use gasnet-ofi on Slingshot-11 systems, but we expect our native ofi comm layer to be the preferred option since that's what we've mostly been developing it for. This is much like using our native ugni layer on Aries systems instead of gasnet-aries because it gives us more control on flagship HPE/Cray systems. In order to evaluate the current state of gasnet-ofi and what we might recommend to users I gathered performance figures on a few systems for 5 benchmarks that expose different patterns/idioms we care about: - Stream (no communication, numa affinity sensitive) - PRK-stencil (little comm, numa affinity sensitive) - ISx (concurrent bulk comm, numa affinity sensitive) - Indexgather (concurrent bulk/aggregated comm) - RA (concurrent fine-grained comm -- RDMA (rmo) and AM (on) variants) ``` chpl --fast test/release/examples/benchmarks/hpcc/stream.chpl chpl --fast test/studies/prk/Stencil/optimized/stencil-opt.chpl -sorder="sqrt(16e9*numLocales / 8):int" chpl --fast test/studies/isx/isx-hand-optimized.chpl -smode=scaling.weakISO chpl --fast test/studies/bale/indexgather/ig.chpl -sN=10000000 -sprintStats -smode=Mode.aggregated chpl --fast test/release/examples/benchmarks/hpcc/ra.chpl -sverify=false -suseOn=false -sN_U="2**(n-12)" -o ra-rmo chpl --fast test/release/examples/benchmarks/hpcc/ra.chpl -sverify=false -suseOn=true -sN_U="2**(n-12)" -o ra-on ./stream -nl 16 ./stencil-opt -nl 16 ./isx-hand-optimized -nl 16 ./ig -nl 16 ./ra-rmo -nl 16 ./ra-on -nl 16 ``` Omni-path: ---------- 16-nodes of an OPA cluster with 32 cores and 192 GB of ram per node: | Config | Stream | PRK-Stencil | ISx | Indexgather | RA-rmo | RA-on | | --------- | --------: | -----------: | ----: | ------------: | -----------: | -----------: | | Gasnet-1 | 2120 GB/s | 940 GFlops/s | 13.1s | 400 MB/s/node | 0.00082 GUPS | 0.00059 GUPS | | 2021.9.0 | 2120 GB/s | 940 GFlops/s | 14.6s | 290 MB/s/node | 0.00045 GUPS | 0.00059 GUPS | | 2022.3.0 | 2120 GB/s | 940 GFlops/s | 13.3s | 415 MB/s/node | 0.00086 GUPS | 0.00058 GUPS | | ser-poll | 2120 GB/s | 945 GFlops/s | 12.3s | 495 MB/s/node | 0.00086 GUPS | 0.00174 GUPS | Previously, omni-path users had to revert to gasnet-1 and the initial ofi support added in 2021.9.0 hurt performance. What we see now is that 2022.3.0 restores performance to gasnet-1 levels and serializing the poller further improves performance. This makes me comfortable telling users to stop falling back to gasnet-1 and just use the current support for omni-path. Slingshot-10: ------------- 16-nodes of a SS-10 system with 128 cores and 512 GB of ram per node: | Config | Stream | PRK-Stencil | ISx | Indexgather | RA-rmo | RA-on | | ---------- | --------: | ------------: | ----: | -------------: | ----------: | ----------: | | gasnet-ofi | 2335 GB/s | 1435 GFlops/s | 31.7s | 1830 MB/s/node | 0.0033 GUPS | 0.0030 GUPS | | gasnet-ucx | 2355 GB/s | 1420 GFlops/s | 16.4s | 2290 MB/s/node | 0.0021 GUPS | 0.0015 GUPS | Generally speaking, gasnet-ucx seems to perform the best on SS-10. I should also note that our native `CHPL_COMM=ofi` layer does not appear to be working correctly on SS-10 so I don't have that to compare to, though we'll want to get that working in the near term. Also note that serializing polling was important for performance for ofi on SS-10. Slingshot-11: ------------- 16-nodes of a SS-11 system with 128 cores and 512 GB of ram per node: | Config | Stream | PRK-Stencil | ISx | Indexgather | RA-rmo | RA-on | | ---------- | --------: | ------------: | ----: | -------------: | ---------: | ---------: | | gasnet-ofi | 2460 GB/s | 1470 MFlops/s | 15.2s | 3875 MB/s/node | 0.003 GUPS | 0.004 GUPS | | comm=ofi | 2565 GB/s | 1540 MFlops/s | 7.3s | 5030 MB/s/node | 0.132 GUPS | 0.018 GUPS Our native `CHPL_COMM=ofi` layer generally outperforms gasnet-ofi. This is especially true for concurrent fine-grained operations like RA where injection is serialized under gasnet currently. Note that much like ugni on XC, we expect our ofi comm layer to be the native/default layer on SS-11, but the comparison is still interesting and I expect we can improve gasnet-ofi SS-11 performance with some tuning. Summary: -------- Overall, GASNet-EX 2022.3.0 significantly improved ofi performance, and bringing in optimizations we applied to other conduits further improved performance. I recommend gasnet-ex for omni-path in the short and long term, gasnet-ucx for ss-10 in the short term and TBD for the long term, and our native ofi layer for ss-11. The other potential targets for ofi in the future are AWS EFA and Cornelis Omni-Path, but these will require more exploration.

ronawho requested a review from gbtitus March 17, 2021 19:07

ronawho force-pushed the serialize-ucx-polling branch from 554e3eb to dcb5451 Compare March 17, 2021 19:09

gbtitus approved these changes Mar 17, 2021

View reviewed changes

ronawho merged commit 51a10b4 into chapel-lang:master Mar 17, 2021

ronawho deleted the serialize-ucx-polling branch March 17, 2021 20:01

ronawho mentioned this pull request Jun 17, 2022

Improve our gasnet support for the ofi conduit #20033

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize gasnet polling calls for UCX too #17419

Serialize gasnet polling calls for UCX too #17419

ronawho commented Mar 17, 2021

gbtitus left a comment

Serialize gasnet polling calls for UCX too #17419

Serialize gasnet polling calls for UCX too #17419

Conversation

ronawho commented Mar 17, 2021

gbtitus left a comment

Choose a reason for hiding this comment