This guide describes how to leverage UCC to accelerate collectives within a parallel programming model, e.g. MPI or OpenSHMEM, contingent on support from the particular implementation of the programming model. For simplicity, this guide uses Open MPI as one such example implementation. However, the described concepts are sufficiently general, so they should transfer to other MPI implementations or programming models as well.
Note that this is not a guide on how to use the UCC API or contribute to UCC, for that see the UCC API documentation available here and consider the technical and legal guidelines in the contributing file.
Build Open MPI with UCC as described here.
To check if your Open MPI build supports UCC accelerated collectives, you can check for
the MCA coll ucc
component:
$ ompi_info | grep ucc
MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
To execute your MPI program with UCC accelerated collectives, the ucc
MCA component
needs to be enabled:
export OMPI_MCA_coll_ucc_enable=1
Currently, it is also required to set
export OMPI_MCA_coll_ucc_priority=100
to work around open-mpi/ompi#9885.
In most situations, this is all that is needed to leverage UCC accelerated collectives from your MPI program. UCC heuristics aim to always select the highest performing implementation for a given collective, and UCC aims to support execution at all scales, from a single node to a full supercomputer.
However, because there are many different system setups, collectives, and message sizes, these heuristics can't be perfect in all cases. The remainder of this User Guide therefore describes the parts of UCC which are necessary for basic UCC tuning. If manual tuning is necessary, an issue report is appreciated at the Github tracker so that this can be considered for future tuning of UCC heuristics.
Also a MPI or other programming model implementation might need to execute a collective not supported by UCC, e.g. because the datatype or reduction operator support is lacking. In these cases the implementation can't call into UCC but need to leverage another collective implementation. See Logging for an example how to detect this in case of Open MPI. If UCC support is missing an issue report describing the use case is appreciated at the Github tracker so that this can be considered for future UCC development.
UCC collective implementations are compositions of one or more Team Layers (TLs). TLs are designed as thin composable abstraction layers with no dependencies between different TLs. To fulfill semantic requirements of programming models like MPI and because not all TLs cover the full functionality required by a given collective (e.g. the SHARP TL does not support intra-node collectives), TLs are composed by Collective Layers (CLs). The list of CLs and TLs supported by the available UCC installation can be queried with:
$ ucc_info -s
Default CLs scores: basic=10 hier=50
Default TLs scores: cuda=40 nccl=20 self=50 ucp=10
This UCC implementations supports two CLs:
basic
: Basic CL available for all supported algorithms and good for most use cases.hier
: Hierarchical CL exploiting the hierarchy on a system, e.g. NVLINK within a node and SHARP for the network. Thehier
CL exposes two hierarchy levels:NODE
containing all ranks running on the same node andNET
containing one rank from each node. In addition to that, there is theFULL
subgroup with all ranks. A concrete example of a hierarchical CL is a pipeline of shared memory UCP reduce with inter-node SHARP and UCP broadcast. Thebasic
CL can leverage the same TLs but would execute in a non-pipelined, less efficient fashion.
and four TLs:
cuda
: TL supporting CUDA device memory exploiting NVLINK connections between GPUs.nccl
: TL leveraging NCCL for collectives on CUDA device memory. In many cases, UCC collectives are directly mapped to NCCL collectives. If that is not possible, a combination of NCCL collectives might be used.self
: TL to support collectives with only 1 participant.ucp
: TL building on UCP point to point communication routines from UCX. This is the most general TL which supports all memory types. If required computation happens local to the memory, e.g. for CUDA device memory CUDA kernels are used for computation.
In addition to those TLs supported by the example Open MPI implementation used in this guide, UCC also supports the following TLs:
sharp
: TL leveraging the NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ in-network computing features to accelerate inter-node collectives.rccl
: TL leveraging RCCL for collectives on ROCm device memory.
UCC is extensible so vendors can provide additional TLs. For example the UCC binaries shipped
with HPC-X add the shm
TL with optimized
CPU shared memory collectives.
UCC exposes environment variables to tune CL and TL selection and behavior. The list of all
environment variables with a description is available from ucc_info
:
$ ucc_info -caf | head -15
# UCX library configuration file
# Uncomment to modify values
#
# UCC configuration
#
#
# Comma separated list of CL components to be used
#
# syntax: comma-separated list of: [basic|hier|all]
#
UCC_CLS=basic
In this guide we will focus on how TLs are selected based on a score. Every time UCC needs to select a TL the TL with the highest score is selected considering:
- The collective type
- The message size
- The memory type
- The team size (number of ranks participating in the collective)
A user can set the UCC_TL_<NAME>_TUNE
environment variables to override the default scores
following this syntax:
UCC_TL_<NAME>_TUNE=token1#token2#...#tokenN,
Passing a #
separated list of tokens to the environment variable. Each token is a :
separated list of qualifiers:
token=coll_type:msg_range:mem_type:team_size:score:alg
Where each qualifier is optional. The only requirement is that either score
or alg
is provided. The qualifiers are
coll_type = coll_type_1,coll_type_2,...,coll_type_n
- a,
separated list of collective types.msg_range = m_start_1-m_end_1,m_start_2-m_end_2,..,m_start_n-m_end_n
- a,
separated list of msg ranges in byte, where each range is represented bystart
andend
values separated by-
. Values can be integers using optional binary prefixes. Supported prefixes areK=1<<10
,M=1<<20
,G=1<<30
and,T=1<<40
. Parsing is case indepdent and ab
can be optionally added. The special valueinf
means MAX msg size. E.g.128
,256b
,4K
,1M
are valid sizes.mem_type = m1,m2,..,mN
- a,
separated list of memory typesteam_size = [t_start_1-t_end_1,t_start_2-t_end_2,...,t_start_N-t_end_N]
- a,
separated list of team size ranges enclosed with[]
.score =
, aint
value from0
toinf
alg = @<value|str>
- character@
followed by either theint
or string representing the collective algorithm.
Supported memory types are:
cpu
: for CPU memory.cuda
: for pinned CUDA Device memory (cudaMalloc
).cuda_managed
: for CUDA Managed Memory (cudaMallocManaged
).rocm
: for pinned ROCm Device memory.
The supported collective types and algorithms can be queried with
$ ucc_info -A
cl/hier algorithms:
Allreduce
0 : rab : intra-node reduce, followed by inter-node allreduce, followed by innode broadcast
1 : split_rail : intra-node reduce_scatter, followed by PPN concurrent inter-node allreduces, followed by intra-node allgather
Alltoall
0 : node_split : splitting alltoall into two concurrent a2av calls withing the node and outside of it
Alltoallv
0 : node_split : splitting alltoallv into two concurrent a2av calls withing the node and outside of it
[...] snip
See the FAQ in the UCC Wiki for more information and concrete examples. If for a given combination, multiple TLs have the same highest score, it is implementation-defined which of those TLs with the highest score is selected.
Tuning UCC heuristics is also possible with the UCC configuration file (ucc.conf
). This file provides a unified way of tailoring the behavior of UCC components - CLs, TLs, and ECs. It can contain any UCC variables of the format VAR = VALUE
, e.g. UCC_TL_NCCL_TUNE=allreduce:cuda:inf#alltoall:0
to force NCCL allreduce for "cuda" buffers and disable NCCL for alltoall. See contrib/ucc.conf
for an example and the FAQ for further details.
To detect if Open MPI leverages UCC for a given collective one can set OMPI_MCA_coll_ucc_verbose=3
checking for output like
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
For example Open MPI leverages UCC for MPI_Alltoall
used by osu_alltoall
from the
OSU Microbenchmarks as can be seen by the log message
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
in the output of
$ OMPI_MCA_coll_ucc_verbose=3 srun ./c/mpi/collective/osu_alltoall -i 1 -x 0 -d cuda -m 1048576:1048576
[...] snip
# OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.0
# Size Avg Latency(us)
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
1048576 2586333.78
[...] snip
For MPI_Alltoallw
Open MPI can't leverage UCC so the output of osu_alltoallw
looks different.
It only contains the log messages for the barriers needed for correct timing and reduces needed
to calculate timing statistics:
$ OMPI_MCA_coll_ucc_verbose=3 srun ./c/mpi/collective/osu_alltoallw -i 1 -x 0 -d cuda -m 1048576:1048576
[...] snip
# OSU MPI-CUDA All-to-Allw Personalized Exchange Latency Test v7.0
# Size Avg Latency(us)
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
1048576 11434.97
[...] snip
To debug the choices made by UCC heuristics, setting UCC_LOG_LEVEL=INFO
provides valuable
information. E.g. it prints score map with all collectives, TLs and memory types supported
[...] snip
ucc_team.c:452 UCC INFO ===== COLL_SCORE_MAP (team_id 32768) =====
ucc_coll_score_map.c:185 UCC INFO Allgather:
ucc_coll_score_map.c:185 UCC INFO Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:185 UCC INFO Cuda: {0..inf}:TL_NCCL:10
ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:185 UCC INFO Allgatherv:
ucc_coll_score_map.c:185 UCC INFO Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:185 UCC INFO Cuda: {0..16383}:TL_NCCL:10 {16K..1048575}:TL_NCCL:10 {1M..inf}:TL_NCCL:10
ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:185 UCC INFO Allreduce:
ucc_coll_score_map.c:185 UCC INFO Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
ucc_coll_score_map.c:185 UCC INFO Cuda: {0..4095}:TL_NCCL:10 {4K..inf}:TL_NCCL:10
ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
ucc_coll_score_map.c:185 UCC INFO Rocm: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
ucc_coll_score_map.c:185 UCC INFO RocmManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[...] snip
UCC 1.2.0 or newer supports the UCC_COLL_TRACE
environment variable:
$ ucc_info -caf | grep -B6 UCC_COLL_TRACE
#
# UCC collective logging level. Higher level will result in more verbose collective info.
# Possible values are: fatal, error, warn, info, debug, trace, data, func, poll.
#
# syntax: [FATAL|ERROR|WARN|DIAG|INFO|DEBUG|TRACE|REQ|DATA|ASYNC|FUNC|POLL]
#
UCC_COLL_TRACE=WARN
With UCC_COLL_TRACE=INFO
UCC reports for every collective which CL and TL has been selected:
$ UCC_COLL_TRACE=INFO srun ./c/mpi/collective/osu_allreduce -i 1 -x 0 -d cuda -m 1048576:1048576
# OSU MPI-CUDA Allreduce Latency Test v7.0
# Size Avg Latency(us)
[1678205653.808236] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768
[1678205653.809882] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Allreduce sum: src={0x7fc1f3a03800, 262144, float32, Cuda}, dst={0x7fc195800000, 262144, float32, Cuda}; CL_BASIC {TL_NCCL}, team_id 32768
[1678205653.810344] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768
[1678205653.810582] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce min root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58b0, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1678205653.810641] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce max root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58a8, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768
[1678205653.810651] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce sum root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58a0, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768
1048576 582.07
[1678205653.810705] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768
- For the CUDA and NCCL TL CUDA device dependent data structures are created when UCC
is initialized which usually happens during
MPI_Init
. For these TLs it is therefore important that the GPU used by an MPI rank does not change afterMPI_Init
is called. - UCC does not support CUDA managed memory for all TLs and collectives.
- Logging of collective tasks as described above using NCCL as example is not unified. E.g. some TLs do not log when a collective is started and finalized.
- UCC FAQ: https://github.com/openucx/ucc/wiki/FAQ
- Output of
ucc_info -caf