diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index 750049ed..516a474f 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -16,9 +16,11 @@ CWP CXI Ceph Containerfile +Containerfiles DNS Dockerfiles Dufourspitze +EFA EMPA ETHZ Ehrenfest @@ -75,6 +77,7 @@ MeteoSwiss NAMD NICs NVMe +NVSHMEM Nordend OpenFabrics OAuth @@ -101,6 +104,7 @@ ROCm RPA Roboto Roothaan +SHMEM SSHService STMV Scopi diff --git a/docs/software/container-engine/guidelines-images/image-comm-fwk.md b/docs/software/container-engine/guidelines-images/image-comm-fwk.md new file mode 100644 index 00000000..1ca39ab5 --- /dev/null +++ b/docs/software/container-engine/guidelines-images/image-comm-fwk.md @@ -0,0 +1,105 @@ +[](){#ref-ce-guidelines-images-commfwk} +# Communication frameworks image + +This page describes a container image providing foundational software components for achieving efficient execution on Alps nodes with NVIDIA GPUs. + +The most important aspect to consider for performance of containerized applications is related to use of high-speed networks, +therefore this image mainly installs communication frameworks and libraries, besides general utility tools. +In particular, the [libfabric](https://ofiwg.github.io/libfabric/) framework (also known as Open Fabrics Interfaces - OFI) is required to interface applications with the Slingshot high-speed network. + +At runtime, the container engine [CXI hook][ref-ce-cxi-hook] will replace the libfabric libraries inside the container with the corresponding libraries on the host system. +This will ensure access to the Slingshot interconnect. + +This image is not intended to be used on its own, but to serve as a base to build higher-level software (e.g. MPI implementations) and application stacks. +For this reason, no performance results are provided in this page. + +A build of this image is currently hosted on the [Quay.io](https://quay.io/) registry at the following reference: +`quay.io/ethcscs/comm-fwk:ofi1.22-ucx1.19-cuda12.8`. +The image name `comm-fwk` is a shortened form of "communication frameworks". + +## Contents + +- Ubuntu 24.04 +- CUDA 12.8.1 +- GDRCopy 2.5.1 +- Libfabric 1.22.0 +- UCX 1.19.0 + +## Containerfile +```Dockerfile +ARG ubuntu_version=24.04 +ARG cuda_version=12.8.1 +FROM docker.io/nvidia/cuda:${cuda_version}-cudnn-devel-ubuntu${ubuntu_version} + +RUN apt-get update \ + && DEBIAN_FRONTEND=noninteractive \ + apt-get install -y \ + build-essential \ + ca-certificates \ + pkg-config \ + automake \ + autoconf \ + libtool \ + cmake \ + gdb \ + strace \ + wget \ + git \ + bzip2 \ + python3 \ + gfortran \ + rdma-core \ + numactl \ + libconfig-dev \ + libuv1-dev \ + libfuse-dev \ + libfuse3-dev \ + libyaml-dev \ + libnl-3-dev \ + libnuma-dev \ + libsensors-dev \ + libcurl4-openssl-dev \ + libjson-c-dev \ + libibverbs-dev \ + --no-install-recommends \ + && rm -rf /var/lib/apt/lists/* + +ARG gdrcopy_version=2.5.1 +RUN git clone --depth 1 --branch v${gdrcopy_version} https://github.com/NVIDIA/gdrcopy.git \ + && cd gdrcopy \ + && export CUDA_PATH=${CUDA_HOME:-$(echo $(which nvcc) | grep -o '.*cuda')} \ + && make CC=gcc CUDA=$CUDA_PATH lib \ + && make lib_install \ + && cd ../ && rm -rf gdrcopy + +# Install libfabric +ARG libfabric_version=1.22.0 +RUN git clone --branch v${libfabric_version} --depth 1 https://github.com/ofiwg/libfabric.git \ + && cd libfabric \ + && ./autogen.sh \ + && ./configure --prefix=/usr --with-cuda=/usr/local/cuda --enable-cuda-dlopen --enable-gdrcopy-dlopen --enable-efa \ + && make -j$(nproc) \ + && make install \ + && ldconfig \ + && cd .. \ + && rm -rf libfabric + +# Install UCX +ARG UCX_VERSION=1.19.0 +RUN wget https://github.com/openucx/ucx/releases/download/v${UCX_VERSION}/ucx-${UCX_VERSION}.tar.gz \ + && tar xzf ucx-${UCX_VERSION}.tar.gz \ + && cd ucx-${UCX_VERSION} \ + && mkdir build \ + && cd build \ + && ../configure --prefix=/usr --with-cuda=/usr/local/cuda --with-gdrcopy=/usr/local --enable-mt --enable-devel-headers \ + && make -j$(nproc) \ + && make install \ + && cd ../.. \ + && rm -rf ucx-${UCX_VERSION}.tar.gz ucx-${UCX_VERSION} +``` + +## Notes +- The image is based on an official NVIDIA CUDA image, and therefore already provides the NCCL library, alongside a complete CUDA installation. +- Communication frameworks are built with explicit support for CUDA and GDRCopy. +- The libfabric [EFA](https://aws.amazon.com/hpc/efa/) provider is included to leave open the possibility to experiment with derived images on AWS infrastructure as well. +- Although only the libfabric framework is required to support Alps' Slingshot network, this image also packages the UCX communication framework to allow building a broader set of software (e.g. some OpenSHMEM implementations) and supporting optimized Infiniband communication as well. diff --git a/docs/software/container-engine/guidelines-images/image-mpich.md b/docs/software/container-engine/guidelines-images/image-mpich.md new file mode 100644 index 00000000..79fadecf --- /dev/null +++ b/docs/software/container-engine/guidelines-images/image-mpich.md @@ -0,0 +1,578 @@ +[](){#ref-ce-guidelines-images-mpich} +# MPICH image + +This page describes a container image featuring the MPICH library as MPI (Message Passing Interface) implementation, with support for CUDA and Libfabric. + +This image is based on the [communication frameworks image][ref-ce-guidelines-images-commfwk], and thus it is suited for hosts with NVIDIA GPUs, like Alps GH200 nodes. + +A build of this image is currently hosted on the [Quay.io](https://quay.io/) registry at the following reference: +`quay.io/ethcscs/mpich:4.3.1-ofi1.22-cuda12.8`. + +## Contents + +- Ubuntu 24.04 +- CUDA 12.8.1 +- GDRCopy 2.5.1 +- Libfabric 1.22.0 +- UCX 1.19.0 +- MPICH 4.3.1 + +## Containerfile +```Dockerfile +FROM quay.io/ethcscs/comm-fwk:ofi1.22-ucx1.19-cuda12.8 + +ARG MPI_VER=4.3.1 +RUN wget -q https://www.mpich.org/static/downloads/${MPI_VER}/mpich-${MPI_VER}.tar.gz \ + && tar xf mpich-${MPI_VER}.tar.gz \ + && cd mpich-${MPI_VER} \ + && ./autogen.sh \ + && ./configure --prefix=/usr --enable-fast=O3,ndebug \ + --disable-fortran --disable-cxx \ + --with-device=ch4:ofi --with-libfabric=/usr \ + --with-cuda=/usr/local/cuda \ + CFLAGS="-L/usr/local/cuda/targets/sbsa-linux/lib/stubs/ -lcuda" \ + CXXFLAGS="-L/usr/local/cuda/targets/sbsa-linux/lib/stubs/ -lcuda" \ + && make -j$(nproc) \ + && make install \ + && ldconfig \ + && cd .. \ + && rm -rf mpich-${MPI_VER}.tar.gz mpich-${MPI_VER} +``` + +!!! tip + This image builds MPICH without Fortran and C++ bindings. In general, C++ bindings are deprecated by the MPI standard. If you require the Fortran bindings, remove the `--disable-fortran` option in the MPICH `configure` command above. + + +## Performance examples + +In this section we demonstrate the performance of the previously created MPICH image using it to build the OSU Micro-Benchmarks 7.5.1, and deploying the resulting image on Alps through the Container Engine to run a variety of benchmarks. + +A build of the image with the OSU benchmarks is available on the [Quay.io](https://quay.io/) registry at the following reference: +`quay.io/ethcscs/osu-mb:7.5-mpich4.3.1-ofi1.22-cuda12.8`. + +### OSU-MB Containerfile +```Dockerfile +FROM quay.io/ethcscs/mpich:4.3.1-ofi1.22-cuda12.8 + +ARG omb_version=7.5.1 +RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-${omb_version}.tar.gz \ + && tar xf osu-micro-benchmarks-${omb_version}.tar.gz \ + && cd osu-micro-benchmarks-${omb_version} \ + && ldconfig /usr/local/cuda/targets/sbsa-linux/lib/stubs \ + && ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS="-O3 -lcuda -lnvidia-ml" \ + --enable-cuda --with-cuda-include=/usr/local/cuda/include \ + --with-cuda-libpath=/usr/local/cuda/lib64 \ + CXXFLAGS="-lmpi -lcuda" \ + && make -j$(nproc) \ + && make install \ + && ldconfig \ + && cd .. \ + && rm -rf osu-micro-benchmarks-${omb_version} osu-micro-benchmarks-${omb_version}.tar.gz + +WORKDIR /usr/local/libexec/osu-micro-benchmarks/mpi +``` + +### Environment Definition File +```toml +image = "quay.io#ethcscs/osu-mb:7.5-mpich4.3.1-ofi1.22-cuda12.8" +``` + +### Notes + +- **Important:** To make sure that GPU-to-GPU performance is good for inter-node communication one must set the variable `MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1`. + This setting can negatively impact performance for other types of communication (e.g. intra-node CPU-to-CPU transfers). +- Since by default MPICH uses PMI-1 or PMI-2 for wire-up and communication between ranks, when using this image the `srun` option `--mpi=pmi2` must be used to run successful multi-rank jobs. + +### Results + +=== "Point-to-point bandwidth, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmi2 --environment=omb-mpich ./pt2pt/osu_bw --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.88 Pass + 2 1.76 Pass + 4 3.53 Pass + 8 7.07 Pass + 16 14.16 Pass + 32 27.76 Pass + 64 56.80 Pass + 128 113.27 Pass + 256 225.42 Pass + 512 445.70 Pass + 1024 883.96 Pass + 2048 1733.54 Pass + 4096 3309.75 Pass + 8192 6188.29 Pass + 16384 12415.59 Pass + 32768 19526.60 Pass + 65536 22624.33 Pass + 131072 23346.67 Pass + 262144 23671.41 Pass + 524288 23847.29 Pass + 1048576 23940.59 Pass + 2097152 23980.12 Pass + 4194304 24007.69 Pass + ``` + +=== "Point-to-point bandwidth, GPU-to-GPU memory, inter-node communication" + ```console + $ MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 srun -N2 --mpi=pmi2 --environment=omb-mpich ./pt2pt/osu_bw --validation D D + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI-CUDA Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.92 Pass + 2 1.80 Pass + 4 3.72 Pass + 8 7.45 Pass + 16 14.91 Pass + 32 29.66 Pass + 64 59.65 Pass + 128 119.08 Pass + 256 236.90 Pass + 512 467.70 Pass + 1024 930.74 Pass + 2048 1808.56 Pass + 4096 3461.06 Pass + 8192 6385.63 Pass + 16384 12768.18 Pass + 32768 19332.39 Pass + 65536 22547.35 Pass + 131072 23297.26 Pass + 262144 23652.07 Pass + 524288 23812.58 Pass + 1048576 23913.85 Pass + 2097152 23971.55 Pass + 4194304 23998.79 Pass + ``` + + +=== "Point-to-point bandwidth, CPU-to-CPU memory, intra-node communication" + ```console + $ srun -N1 -n2 --mpi=pmi2 --environment=omb-mpich ./pt2pt/osu_bw --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 1.28 Pass + 2 2.60 Pass + 4 5.20 Pass + 8 10.39 Pass + 16 20.85 Pass + 32 41.56 Pass + 64 83.23 Pass + 128 164.73 Pass + 256 326.92 Pass + 512 632.98 Pass + 1024 1209.82 Pass + 2048 2352.68 Pass + 4096 4613.67 Pass + 8192 8881.00 Pass + 16384 7435.51 Pass + 32768 9369.82 Pass + 65536 11644.51 Pass + 131072 13198.71 Pass + 262144 14058.41 Pass + 524288 12958.24 Pass + 1048576 12836.55 Pass + 2097152 13117.14 Pass + 4194304 13187.01 Pass + ``` + + +=== "Point-to-point bandwidth, GPU-to-GPU memory, intra-node communication" + ```console + $ srun -N1 -n2 --mpi=pmi2 --environment=omb-mpich ./pt2pt/osu_bw --validation D D + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI-CUDA Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.13 Pass + 2 0.27 Pass + 4 0.55 Pass + 8 1.10 Pass + 16 2.20 Pass + 32 4.40 Pass + 64 8.77 Pass + 128 17.50 Pass + 256 35.01 Pass + 512 70.14 Pass + 1024 140.35 Pass + 2048 278.91 Pass + 4096 555.96 Pass + 8192 1104.97 Pass + 16384 2214.87 Pass + 32768 4422.67 Pass + 65536 8833.18 Pass + 131072 17765.30 Pass + 262144 33834.24 Pass + 524288 59704.15 Pass + 1048576 84566.94 Pass + 2097152 102221.49 Pass + 4194304 113955.83 Pass + ``` + + +=== "Point-to-point bi-directional bandwidth, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmi2 --environment=omb-mpich ./pt2pt/osu_bibw --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bibw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bibw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI Bi-Directional Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 1.03 Pass + 2 2.07 Pass + 4 4.14 Pass + 8 8.28 Pass + 16 16.54 Pass + 32 33.07 Pass + 64 66.08 Pass + 128 131.65 Pass + 256 258.60 Pass + 512 518.60 Pass + 1024 1036.09 Pass + 2048 2072.16 Pass + 4096 4142.18 Pass + 8192 7551.70 Pass + 16384 14953.49 Pass + 32768 23871.35 Pass + 65536 33767.12 Pass + 131072 39284.40 Pass + 262144 42638.43 Pass + 524288 44602.52 Pass + 1048576 45621.16 Pass + 2097152 46159.65 Pass + 4194304 46433.80 Pass + ``` + + +=== "Point-to-point bi-directional bandwidth, GPU-to-GPU memory, inter-node communication" + ```console + $ MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 srun -N2 --mpi=pmi2 --environment=omb-mpich ./pt2pt/osu_bibw --validation D D + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bibw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bibw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 1.05 Pass + 2 2.10 Pass + 4 4.20 Pass + 8 8.40 Pass + 16 16.84 Pass + 32 33.63 Pass + 64 67.01 Pass + 128 132.11 Pass + 256 258.74 Pass + 512 515.52 Pass + 1024 1025.44 Pass + 2048 2019.51 Pass + 4096 3844.87 Pass + 8192 6123.96 Pass + 16384 13244.25 Pass + 32768 22521.76 Pass + 65536 34040.97 Pass + 131072 39503.52 Pass + 262144 42827.91 Pass + 524288 44663.44 Pass + 1048576 45629.24 Pass + 2097152 46167.41 Pass + 4194304 46437.18 Pass + ``` + + +=== "Point-to-point latency, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmi2 --environment=omb-mpich ./pt2pt/osu_latency --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_latency: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_latency: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 3.00 Pass + 2 2.99 Pass + 4 2.99 Pass + 8 3.07 Pass + 16 2.99 Pass + 32 3.08 Pass + 64 3.01 Pass + 128 3.88 Pass + 256 4.43 Pass + 512 4.62 Pass + 1024 4.47 Pass + 2048 4.57 Pass + 4096 4.79 Pass + 8192 7.92 Pass + 16384 8.53 Pass + 32768 9.48 Pass + 65536 10.92 Pass + 131072 13.84 Pass + 262144 19.19 Pass + 524288 30.05 Pass + 1048576 51.73 Pass + 2097152 94.94 Pass + 4194304 181.46 Pass + ``` + + +=== "All-to-all collective latency, CPU-to-CPU memory, multiple nodes" + ```console + $ srun -N2 --ntasks-per-node=4 --mpi=pmi2 --environment=omb-mpich ./collective/osu_alltoall --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI All-to-All Personalized Exchange Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 22.25 Pass + 2 22.34 Pass + 4 21.83 Pass + 8 21.72 Pass + 16 21.74 Pass + 32 21.71 Pass + 64 22.02 Pass + 128 22.35 Pass + 256 22.84 Pass + 512 23.42 Pass + 1024 24.61 Pass + 2048 24.99 Pass + 4096 26.02 Pass + 8192 29.17 Pass + 16384 68.81 Pass + 32768 95.63 Pass + 65536 181.42 Pass + 131072 306.83 Pass + 262144 526.50 Pass + 524288 960.52 Pass + 1048576 1823.52 Pass + ``` + + +=== "All-to-all collective latency, GPU-to-GPU memory, multiple nodes" + ```console + $ MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 srun -N2 --ntasks-per-node=4 --mpi=pmi2 --environment=omb-mpich ./collective/osu_alltoall --validation -d cuda + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 65.62 Pass + 2 65.51 Pass + 4 65.46 Pass + 8 65.40 Pass + 16 65.58 Pass + 32 64.97 Pass + 64 65.01 Pass + 128 65.31 Pass + 256 65.03 Pass + 512 65.14 Pass + 1024 65.67 Pass + 2048 66.23 Pass + 4096 66.69 Pass + 8192 67.47 Pass + 16384 85.99 Pass + 32768 103.15 Pass + 65536 120.40 Pass + 131072 135.64 Pass + 262144 162.24 Pass + 524288 213.84 Pass + 1048576 317.07 Pass + ``` + + +### Results without the CXI hook +On many Alps vClusters, the Container Engine is configured with the CXI hook enabled by default, enabling transparent access to the Slingshot interconnect. + +This section demonstrates the performance benefit of the CXI hook by explicitly disabling it through the EDF: +```console +$ cat .edf/omb-mpich-no-cxi.toml +image = "quay.io#ethcscs/osu-mb:7.5-mpich4.3.1-ofi1.22-cuda12.8" + +[annotations] +com.hooks.cxi.enabled="false" +``` + +=== "Point-to-point bandwidth, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmi2 --environment=omb-mpich-no-cxi ./pt2pt/osu_bw --validation + + # OSU MPI Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.14 Pass + 2 0.28 Pass + 4 0.56 Pass + 8 1.15 Pass + 16 2.32 Pass + 32 4.55 Pass + 64 9.36 Pass + 128 18.20 Pass + 256 20.26 Pass + 512 39.11 Pass + 1024 55.88 Pass + 2048 108.19 Pass + 4096 142.91 Pass + 8192 393.95 Pass + 16384 307.93 Pass + 32768 1205.61 Pass + 65536 1723.86 Pass + 131072 2376.59 Pass + 262144 2847.85 Pass + 524288 3277.75 Pass + 1048576 3580.23 Pass + 2097152 3697.47 Pass + 4194304 3764.11 Pass + ``` + +=== "Point-to-point bandwidth, GPU-to-GPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmi2 --environment=omb-mpich-no-cxi ./pt2pt/osu_bw --validation D D + + # OSU MPI-CUDA Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.04 Pass + 2 0.08 Pass + 4 0.16 Pass + 8 0.31 Pass + 16 0.62 Pass + 32 1.24 Pass + 64 2.46 Pass + 128 4.80 Pass + 256 7.33 Pass + 512 14.40 Pass + 1024 24.43 Pass + 2048 47.68 Pass + 4096 85.40 Pass + 8192 161.68 Pass + 16384 306.15 Pass + 32768 520.57 Pass + 65536 818.99 Pass + 131072 1160.48 Pass + 262144 1436.44 Pass + 524288 1676.61 Pass + 1048576 2003.55 Pass + 2097152 2104.65 Pass + 4194304 2271.56 Pass + ``` + +=== "Point-to-point latency, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmi2 --environment=omb-mpich-no-cxi ./pt2pt/osu_latency --validation + + # OSU MPI Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 38.25 Pass + 2 38.58 Pass + 4 38.49 Pass + 8 38.43 Pass + 16 38.40 Pass + 32 38.49 Pass + 64 39.18 Pass + 128 39.23 Pass + 256 45.17 Pass + 512 53.49 Pass + 1024 59.60 Pass + 2048 48.83 Pass + 4096 50.84 Pass + 8192 51.45 Pass + 16384 52.35 Pass + 32768 58.92 Pass + 65536 74.88 Pass + 131072 100.32 Pass + 262144 135.35 Pass + 524288 219.52 Pass + 1048576 384.61 Pass + 2097152 706.79 Pass + 4194304 1341.79 Pass + ``` + + +=== "All-to-all collective latency, CPU-to-CPU memory, multiple nodes" + ```console + $ srun -N2 --ntasks-per-node=4 --mpi=pmi2 --environment=omb-mpich-no-cxi ./collective/osu_alltoall --validation + + # OSU MPI All-to-All Personalized Exchange Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 169.19 Pass + 2 169.50 Pass + 4 170.35 Pass + 8 168.81 Pass + 16 169.71 Pass + 32 169.60 Pass + 64 169.47 Pass + 128 171.48 Pass + 256 334.47 Pass + 512 343.06 Pass + 1024 703.55 Pass + 2048 449.30 Pass + 4096 454.68 Pass + 8192 468.90 Pass + 16384 532.46 Pass + 32768 578.95 Pass + 65536 1164.92 Pass + 131072 1511.04 Pass + 262144 2287.48 Pass + 524288 3668.35 Pass + 1048576 6498.36 Pass + ``` + + +=== "All-to-all collective latency, GPU-to-GPU memory, multiple nodes" + ```console + $ srun -N2 --ntasks-per-node=4 --mpi=pmi2 --environment=omb-mpich-no-cxi ./collective/osu_alltoall --validation -d cuda + + # OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 276.29 Pass + 2 273.94 Pass + 4 273.53 Pass + 8 273.88 Pass + 16 274.83 Pass + 32 274.90 Pass + 64 276.85 Pass + 128 278.17 Pass + 256 413.21 Pass + 512 442.62 Pass + 1024 793.14 Pass + 2048 547.57 Pass + 4096 561.82 Pass + 8192 570.71 Pass + 16384 624.20 Pass + 32768 657.30 Pass + 65536 1168.43 Pass + 131072 1451.91 Pass + 262144 2049.24 Pass + 524288 3061.54 Pass + 1048576 5238.24 Pass + ``` diff --git a/docs/software/container-engine/guidelines-images/image-nccl-tests.md b/docs/software/container-engine/guidelines-images/image-nccl-tests.md new file mode 100644 index 00000000..3f0801df --- /dev/null +++ b/docs/software/container-engine/guidelines-images/image-nccl-tests.md @@ -0,0 +1,185 @@ +[](){#ref-ce-guidelines-images-nccl-tests} +# NCCL Tests image + +This page describes a container image featuring the [NCCL Tests](https://github.com/NVIDIA/nccl-tests) to demonstrate how to efficiently execute NCCL-based containerized software on Alps. + +This image is based on the [OpenMPI image][ref-ce-guidelines-images-ompi], and thus it is suited for hosts with NVIDIA GPUs, like Alps GH200 nodes. + +A build of this image is currently hosted on the [Quay.io](https://quay.io/) registry at the following reference: +`quay.io/ethcscs/nccl-tests:2.17.1-ompi5.0.8-ofi1.22-cuda12.8`. + +## Contents + +- Ubuntu 24.04 +- CUDA 12.8.1 (includes NCCL) +- GDRCopy 2.5.1 +- Libfabric 1.22.0 +- UCX 1.19.0 +- OpenMPI 5.0.8 +- NCCL Tests 2.17.1 + +## Containerfile +```Dockerfile +FROM quay.io/ethcscs/ompi:5.0.8-ofi1.22-cuda12.8 + +ARG nccl_tests_version=2.17.1 +RUN wget -O nccl-tests-${nccl_tests_version}.tar.gz https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v${nccl_tests_version}.tar.gz \ + && tar xf nccl-tests-${nccl_tests_version}.tar.gz \ + && cd nccl-tests-${nccl_tests_version} \ + && MPI=1 make -j$(nproc) \ + && cd .. \ + && rm -rf nccl-tests-${nccl_tests_version}.tar.gz +``` + +!!! note + This image builds NCCL tests with MPI support enabled. + +## Performance examples + +### Environment Definition File +```toml +image = "quay.io#ethcscs/nccl-tests:2.17.1-ompi5.0.8-ofi1.22-cuda12.8" + +[env] +PMIX_MCA_psec="native" # (1)! + +[annotations] +com.hooks.aws_ofi_nccl.enabled = "true" +com.hooks.aws_ofi_nccl.variant = "cuda12" +``` + +1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. + +### Notes + +- Since OpenMPI uses PMIx for wire-up and communication between ranks, when using this image the `srun` option `--mpi=pmix` must be used to run successful multi-rank jobs. +- NCCL requires the presence of the [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) in order to correctly interface with Libfabric and (through the latter) the Slingshot interconnect. Therefore, for optimal performance the [related CE hook][ref-ce-aws-ofi-hook] must be enabled and set to match the CUDA version in the container. +- Libfabric itself is usually injected by the [CXI hook][ref-ce-cxi-hook], which is enabled by default on several Alps vClusters. + +### Results + +=== "All-reduce latency test on 2 nodes, 8 GPUs" + ```console + $ srun -N2 -t5 --mpi=pmix --ntasks-per-node=4 --environment=nccl-test-ompi /nccl-tests-2.17.1/build/all_reduce_perf -b 8 -e 128M -f 2 + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + # Collective test starting: all_reduce_perf + # nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 + # + # Using devices + # Rank 0 Group 0 Pid 204199 on nid005471 device 0 [0009:01:00] NVIDIA GH200 120GB + # Rank 1 Group 0 Pid 204200 on nid005471 device 1 [0019:01:00] NVIDIA GH200 120GB + # Rank 2 Group 0 Pid 204201 on nid005471 device 2 [0029:01:00] NVIDIA GH200 120GB + # Rank 3 Group 0 Pid 204202 on nid005471 device 3 [0039:01:00] NVIDIA GH200 120GB + # Rank 4 Group 0 Pid 155254 on nid005487 device 0 [0009:01:00] NVIDIA GH200 120GB + # Rank 5 Group 0 Pid 155255 on nid005487 device 1 [0019:01:00] NVIDIA GH200 120GB + # Rank 6 Group 0 Pid 155256 on nid005487 device 2 [0029:01:00] NVIDIA GH200 120GB + # Rank 7 Group 0 Pid 155257 on nid005487 device 3 [0039:01:00] NVIDIA GH200 120GB + # + # out-of-place in-place + # size count type redop root time algbw busbw #wrong time algbw busbw #wrong + # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) + 8 2 float sum -1 17.93 0.00 0.00 0 17.72 0.00 0.00 0 + 16 4 float sum -1 17.65 0.00 0.00 0 17.63 0.00 0.00 0 + 32 8 float sum -1 17.54 0.00 0.00 0 17.43 0.00 0.00 0 + 64 16 float sum -1 19.27 0.00 0.01 0 19.21 0.00 0.01 0 + 128 32 float sum -1 18.86 0.01 0.01 0 18.67 0.01 0.01 0 + 256 64 float sum -1 18.83 0.01 0.02 0 19.02 0.01 0.02 0 + 512 128 float sum -1 19.72 0.03 0.05 0 19.40 0.03 0.05 0 + 1024 256 float sum -1 20.35 0.05 0.09 0 20.32 0.05 0.09 0 + 2048 512 float sum -1 22.07 0.09 0.16 0 21.72 0.09 0.17 0 + 4096 1024 float sum -1 31.97 0.13 0.22 0 31.58 0.13 0.23 0 + 8192 2048 float sum -1 37.21 0.22 0.39 0 35.84 0.23 0.40 0 + 16384 4096 float sum -1 37.29 0.44 0.77 0 36.53 0.45 0.78 0 + 32768 8192 float sum -1 39.61 0.83 1.45 0 37.09 0.88 1.55 0 + 65536 16384 float sum -1 61.03 1.07 1.88 0 68.45 0.96 1.68 0 + 131072 32768 float sum -1 81.41 1.61 2.82 0 72.94 1.80 3.14 0 + 262144 65536 float sum -1 127.0 2.06 3.61 0 108.9 2.41 4.21 0 + 524288 131072 float sum -1 170.3 3.08 5.39 0 349.6 1.50 2.62 0 + 1048576 262144 float sum -1 164.3 6.38 11.17 0 187.7 5.59 9.77 0 + 2097152 524288 float sum -1 182.1 11.51 20.15 0 180.6 11.61 20.32 0 + 4194304 1048576 float sum -1 292.7 14.33 25.08 0 295.4 14.20 24.85 0 + 8388608 2097152 float sum -1 344.5 24.35 42.61 0 345.7 24.27 42.47 0 + 16777216 4194304 float sum -1 461.7 36.34 63.59 0 454.0 36.95 64.67 0 + 33554432 8388608 float sum -1 686.5 48.88 85.54 0 686.6 48.87 85.52 0 + 67108864 16777216 float sum -1 1090.5 61.54 107.69 0 1083.5 61.94 108.39 0 + 134217728 33554432 float sum -1 1916.4 70.04 122.57 0 1907.8 70.35 123.11 0 + # Out of bounds values : 0 OK + # Avg bus bandwidth : 19.7866 + # + # Collective test concluded: all_reduce_perf + ``` + +### Results without the AWS OFI NCCL hook +This section demonstrates the performance benefit of the AWS OFI NCCL hook by not enabling it through the EDF: +```console +$ cat ~/.edf/nccl-test-ompi-no-awsofinccl.toml +image = "quay.io#ethcscs/nccl-tests:2.17.1-ompi5.0.8-ofi1.22-cuda12.8" + +[env] +PMIX_MCA_psec="native" +``` + +=== "All-reduce latency test on 2 nodes, 8 GPUs" + ```console + $ srun -N2 -t5 --mpi=pmix --ntasks-per-node=4 --environment=nccl-test-ompi /nccl-tests-2.17.1/build/all_reduce_perf -b 8 -e 128M -f 2 + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /nccl-tests-2.17.1/build/all_reduce_perf: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + # Collective test starting: all_reduce_perf + # nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 + # + # Using devices + # Rank 0 Group 0 Pid 202829 on nid005471 device 0 [0009:01:00] NVIDIA GH200 120GB + # Rank 1 Group 0 Pid 202830 on nid005471 device 1 [0019:01:00] NVIDIA GH200 120GB + # Rank 2 Group 0 Pid 202831 on nid005471 device 2 [0029:01:00] NVIDIA GH200 120GB + # Rank 3 Group 0 Pid 202832 on nid005471 device 3 [0039:01:00] NVIDIA GH200 120GB + # Rank 4 Group 0 Pid 154517 on nid005487 device 0 [0009:01:00] NVIDIA GH200 120GB + # Rank 5 Group 0 Pid 154518 on nid005487 device 1 [0019:01:00] NVIDIA GH200 120GB + # Rank 6 Group 0 Pid 154519 on nid005487 device 2 [0029:01:00] NVIDIA GH200 120GB + # Rank 7 Group 0 Pid 154520 on nid005487 device 3 [0039:01:00] NVIDIA GH200 120GB + # + # out-of-place in-place + # size count type redop root time algbw busbw #wrong time algbw busbw #wrong + # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) + 8 2 float sum -1 85.47 0.00 0.00 0 53.44 0.00 0.00 0 + 16 4 float sum -1 52.41 0.00 0.00 0 51.11 0.00 0.00 0 + 32 8 float sum -1 50.45 0.00 0.00 0 50.40 0.00 0.00 0 + 64 16 float sum -1 62.58 0.00 0.00 0 50.70 0.00 0.00 0 + 128 32 float sum -1 50.94 0.00 0.00 0 50.77 0.00 0.00 0 + 256 64 float sum -1 50.76 0.01 0.01 0 51.77 0.00 0.01 0 + 512 128 float sum -1 163.2 0.00 0.01 0 357.5 0.00 0.00 0 + 1024 256 float sum -1 373.0 0.00 0.00 0 59.31 0.02 0.03 0 + 2048 512 float sum -1 53.22 0.04 0.07 0 52.58 0.04 0.07 0 + 4096 1024 float sum -1 55.95 0.07 0.13 0 56.63 0.07 0.13 0 + 8192 2048 float sum -1 58.52 0.14 0.24 0 58.62 0.14 0.24 0 + 16384 4096 float sum -1 108.7 0.15 0.26 0 107.8 0.15 0.27 0 + 32768 8192 float sum -1 184.1 0.18 0.31 0 183.5 0.18 0.31 0 + 65536 16384 float sum -1 325.0 0.20 0.35 0 325.4 0.20 0.35 0 + 131072 32768 float sum -1 592.7 0.22 0.39 0 591.5 0.22 0.39 0 + 262144 65536 float sum -1 942.0 0.28 0.49 0 941.4 0.28 0.49 0 + 524288 131072 float sum -1 1143.1 0.46 0.80 0 1138.0 0.46 0.81 0 + 1048576 262144 float sum -1 1502.2 0.70 1.22 0 1478.9 0.71 1.24 0 + 2097152 524288 float sum -1 921.8 2.28 3.98 0 899.8 2.33 4.08 0 + 4194304 1048576 float sum -1 1443.1 2.91 5.09 0 1432.7 2.93 5.12 0 + 8388608 2097152 float sum -1 2437.7 3.44 6.02 0 2417.0 3.47 6.07 0 + 16777216 4194304 float sum -1 5036.9 3.33 5.83 0 5003.6 3.35 5.87 0 + 33554432 8388608 float sum -1 17388 1.93 3.38 0 17275 1.94 3.40 0 + 67108864 16777216 float sum -1 21253 3.16 5.53 0 21180 3.17 5.54 0 + 134217728 33554432 float sum -1 43293 3.10 5.43 0 43396 3.09 5.41 0 + # Out of bounds values : 0 OK + # Avg bus bandwidth : 1.58767 + # + # Collective test concluded: all_reduce_perf + ``` diff --git a/docs/software/container-engine/guidelines-images/image-nvshmem.md b/docs/software/container-engine/guidelines-images/image-nvshmem.md new file mode 100644 index 00000000..41406424 --- /dev/null +++ b/docs/software/container-engine/guidelines-images/image-nvshmem.md @@ -0,0 +1,239 @@ +[](){#ref-ce-guidelines-images-nvshmem} +# NVSHMEM image + +This page describes a container image featuring the [NVSHMEM](https://developer.nvidia.com/nvshmem) parallel programming library with support for libfabric, and demonstrates how to efficiently run said image on Alps. + +This image is based on the [OpenMPI image][ref-ce-guidelines-images-ompi], and thus it is suited for hosts with NVIDIA GPUs, like Alps GH200 nodes. + +A build of this image is currently hosted on the [Quay.io](https://quay.io/) registry at the following reference: +`quay.io/ethcscs/nvshmem:3.4.5-ompi5.0.8-ofi1.22-cuda12.8`. + +## Contents + +- Ubuntu 24.04 +- CUDA 12.8.1 (includes NCCL) +- GDRCopy 2.5.1 +- Libfabric 1.22.0 +- UCX 1.19.0 +- OpenMPI 5.0.8 +- NVSHMEM 3.4.5 + +## Containerfile +```Dockerfile +FROM quay.io/ethcscs/ompi:5.0.8-ofi1.22-cuda12.8 + +RUN apt-get update \ + && DEBIAN_FRONTEND=noninteractive \ + apt-get install -y \ + python3-venv \ + python3-dev \ + --no-install-recommends \ + && rm -rf /var/lib/apt/lists/* \ + && rm /usr/lib/python3.12/EXTERNALLY-MANAGED + +# Build NVSHMEM from source +RUN wget -q https://developer.download.nvidia.com/compute/redist/nvshmem/3.4.5/source/nvshmem_src_cuda12-all-all-3.4.5.tar.gz \ + && tar -xvf nvshmem_src_cuda12-all-all-3.4.5.tar.gz \ + && cd nvshmem_src \ + && NVSHMEM_BUILD_EXAMPLES=0 \ + NVSHMEM_BUILD_TESTS=1 \ + NVSHMEM_DEBUG=0 \ + NVSHMEM_DEVEL=0 \ + NVSHMEM_DEFAULT_PMI2=0 \ + NVSHMEM_DEFAULT_PMIX=1 \ + NVSHMEM_DISABLE_COLL_POLL=1 \ + NVSHMEM_ENABLE_ALL_DEVICE_INLINING=0 \ + NVSHMEM_GPU_COLL_USE_LDST=0 \ + NVSHMEM_LIBFABRIC_SUPPORT=1 \ + NVSHMEM_MPI_SUPPORT=1 \ + NVSHMEM_MPI_IS_OMPI=1 \ + NVSHMEM_NVTX=1 \ + NVSHMEM_PMIX_SUPPORT=1 \ + NVSHMEM_SHMEM_SUPPORT=1 \ + NVSHMEM_TEST_STATIC_LIB=0 \ + NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ + NVSHMEM_TRACE=0 \ + NVSHMEM_USE_DLMALLOC=0 \ + NVSHMEM_USE_NCCL=1 \ + NVSHMEM_USE_GDRCOPY=1 \ + NVSHMEM_VERBOSE=0 \ + NVSHMEM_DEFAULT_UCX=0 \ + NVSHMEM_UCX_SUPPORT=0 \ + NVSHMEM_IBGDA_SUPPORT=0 \ + NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=0 \ + NVSHMEM_IBDEVX_SUPPORT=0 \ + NVSHMEM_IBRC_SUPPORT=0 \ + LIBFABRIC_HOME=/usr \ + NCCL_HOME=/usr \ + GDRCOPY_HOME=/usr/local \ + MPI_HOME=/usr \ + SHMEM_HOME=/usr \ + NVSHMEM_HOME=/usr \ + cmake . \ + && make -j$(nproc) \ + && make install \ + && ldconfig \ + && cd .. \ + && rm -r nvshmem_src nvshmem_src_cuda12-all-all-3.4.5.tar.gz +``` + +!!! note + - This image also builds the performance tests bundled with NVSHMEM (`NVSHMEM_BUILD_TESTS=1`) to demonstrate performance below. The performance tests, in turn, require the installation of Python dependencies. When building images intended solely for production purposes, you may exclude both those elements. + - Notice that NVSHMEM is configured with support for libfabric explicitly enabled (`NVSHMEM_LIBFABRIC_SUPPORT=1`). + - Since this image is meant primarily to run on Alps, NVSHMEM is built without support for UCX and Infiniband components. + - Since this image uses OpenMPI (which provides PMIx) as MPI implementation, NVSHMEM is also configured to default to PMIx for bootstrapping (`NVSHMEM_PMIX_SUPPORT=1`). + +## Performance examples + +### Environment Definition File +```toml +image = "quay.io#ethcscs/nvshmem:3.4.5-ompi5.0.8-ofi1.22-cuda12.8" + +[env] +PMIX_MCA_psec="native" # (1)! +NVSHMEM_REMOTE_TRANSPORT="libfabric" +NVSHMEM_LIBFABRIC_PROVIDER="cxi" +NVSHMEM_DISABLE_CUDA_VMM="1" # (2)! + +[annotations] +com.hooks.aws_ofi_nccl.enabled = "true" +com.hooks.aws_ofi_nccl.variant = "cuda12" +``` + +1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. +2. NVSHMEM's `libfabric` transport does not support VMM yet, so VMM must be disabled by setting the environment variable `NVSHMEM_DISABLE_CUDA_VMM=1`. + +### Notes + +- Since NVSHMEM has been configured in the Containerfile to use PMIx for bootstrapping, when using this image the `srun` option `--mpi=pmix` must be used to run successful multi-rank jobs. +- Other bootstrapping methods (including different PMI implementations) can be specified for NVSHMEM through the related [environment variables](https://docs.nvidia.com/nvshmem/api/gen/env.html#bootstrap-options). When bootstrapping through PMI or MPI through Slurm, ensure that the PMI implementation used by Slurm (i.e. `srun --mpi` option) matches the one expected by NVSHMEM or the MPI library. +- NCCL requires the presence of the [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl) in order to correctly interface with Libfabric and (through the latter) the Slingshot interconnect. Therefore, for optimal performance the [related CE hook][ref-ce-aws-ofi-hook] must be enabled and set to match the CUDA version in the container. +- Libfabric itself is usually injected by the [CXI hook][ref-ce-cxi-hook], which is enabled by default on several Alps vClusters. + +### Results + +=== "All-to-all latency test on 2 nodes, 8 GPUs" + ```console + $ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=nvshmem /usr/local/nvshmem/bin/perftest/device/coll/alltoall_latency + Runtime options after parsing command line arguments + min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0 + Note: Above is full list of options, any given test will use only a subset of these variables. + mype: 6 mype_node: 2 device name: NVIDIA GH200 120GB bus id: 1 + Runtime options after parsing command line arguments + min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0 + Note: Above is full list of options, any given test will use only a subset of these variables. + mype: 5 mype_node: 1 device name: NVIDIA GH200 120GB bus id: 1 + Runtime options after parsing command line arguments + min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0 + Note: Above is full list of options, any given test will use only a subset of these variables. + mype: 7 mype_node: 3 device name: NVIDIA GH200 120GB bus id: 1 + Runtime options after parsing command line arguments + min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0 + Note: Above is full list of options, any given test will use only a subset of these variables. + mype: 4 mype_node: 0 device name: NVIDIA GH200 120GB bus id: 1 + Runtime options after parsing command line arguments + min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0 + Note: Above is full list of options, any given test will use only a subset of these variables. + mype: 0 mype_node: 0 device name: NVIDIA GH200 120GB bus id: 1 + #alltoall_device + size(B) count type scope latency(us) algbw(GB/s) busbw(GB/s) + 32 8 32-bit thread 116.220796 0.000 0.000 + 64 16 32-bit thread 112.700796 0.001 0.000 + 128 32 32-bit thread 113.571203 0.001 0.001 + 256 64 32-bit thread 111.123204 0.002 0.002 + 512 128 32-bit thread 111.075199 0.005 0.004 + 1024 256 32-bit thread 110.131204 0.009 0.008 + 2048 512 32-bit thread 111.030400 0.018 0.016 + 4096 1024 32-bit thread 110.985601 0.037 0.032 + 8192 2048 32-bit thread 111.039996 0.074 0.065 + #alltoall_device + size(B) count type scope latency(us) algbw(GB/s) busbw(GB/s) + 32 8 32-bit warp 89.801598 0.000 0.000 + 64 16 32-bit warp 90.563202 0.001 0.001 + 128 32 32-bit warp 89.830399 0.001 0.001 + 256 64 32-bit warp 88.863999 0.003 0.003 + 512 128 32-bit warp 89.686400 0.006 0.005 + 1024 256 32-bit warp 88.908798 0.012 0.010 + 2048 512 32-bit warp 88.819200 0.023 0.020 + 4096 1024 32-bit warp 89.670402 0.046 0.040 + 8192 2048 32-bit warp 88.889599 0.092 0.081 + 16384 4096 32-bit warp 88.972801 0.184 0.161 + 32768 8192 32-bit warp 89.564800 0.366 0.320 + 65536 16384 32-bit warp 89.888000 0.729 0.638 + #alltoall_device + size(B) count type scope latency(us) algbw(GB/s) busbw(GB/s) + 32 8 32-bit block 89.747202 0.000 0.000 + 64 16 32-bit block 88.086402 0.001 0.001 + 128 32 32-bit block 87.254399 0.001 0.001 + 256 64 32-bit block 87.401599 0.003 0.003 + 512 128 32-bit block 88.095999 0.006 0.005 + 1024 256 32-bit block 87.273598 0.012 0.010 + 2048 512 32-bit block 88.086402 0.023 0.020 + 4096 1024 32-bit block 88.940799 0.046 0.040 + 8192 2048 32-bit block 88.095999 0.093 0.081 + 16384 4096 32-bit block 87.247998 0.188 0.164 + 32768 8192 32-bit block 88.976002 0.368 0.322 + 65536 16384 32-bit block 88.121599 0.744 0.651 + 131072 32768 32-bit block 90.579200 1.447 1.266 + 262144 65536 32-bit block 91.360003 2.869 2.511 + 524288 131072 32-bit block 101.145601 5.183 4.536 + 1048576 262144 32-bit block 111.052799 9.442 8.262 + 2097152 524288 32-bit block 137.164795 15.289 13.378 + 4194304 1048576 32-bit block 183.171201 22.898 20.036 + #alltoall_device + size(B) count type scope latency(us) algbw(GB/s) busbw(GB/s) + 64 8 64-bit thread 111.955202 0.001 0.001 + 128 16 64-bit thread 113.420796 0.001 0.001 + 256 32 64-bit thread 108.508801 0.002 0.002 + 512 64 64-bit thread 110.204804 0.005 0.004 + 1024 128 64-bit thread 109.487998 0.009 0.008 + 2048 256 64-bit thread 109.462404 0.019 0.016 + 4096 512 64-bit thread 110.156798 0.037 0.033 + 8192 1024 64-bit thread 109.401596 0.075 0.066 + 16384 2048 64-bit thread 108.591998 0.151 0.132 + #alltoall_device + size(B) count type scope latency(us) algbw(GB/s) busbw(GB/s) + 64 8 64-bit warp 88.896000 0.001 0.001 + 128 16 64-bit warp 89.679998 0.001 0.001 + 256 32 64-bit warp 88.950402 0.003 0.003 + 512 64 64-bit warp 89.606398 0.006 0.005 + 1024 128 64-bit warp 89.775997 0.011 0.010 + 2048 256 64-bit warp 88.838398 0.023 0.020 + 4096 512 64-bit warp 90.671998 0.045 0.040 + 8192 1024 64-bit warp 89.699203 0.091 0.080 + 16384 2048 64-bit warp 89.011198 0.184 0.161 + 32768 4096 64-bit warp 89.622402 0.366 0.320 + 65536 8192 64-bit warp 88.905603 0.737 0.645 + 131072 16384 64-bit warp 89.766401 1.460 1.278 + #alltoall_device + size(B) count type scope latency(us) algbw(GB/s) busbw(GB/s) + 64 8 64-bit block 89.788800 0.001 0.001 + 128 16 64-bit block 88.012803 0.001 0.001 + 256 32 64-bit block 87.353599 0.003 0.003 + 512 64 64-bit block 88.000000 0.006 0.005 + 1024 128 64-bit block 87.225598 0.012 0.010 + 2048 256 64-bit block 87.225598 0.023 0.021 + 4096 512 64-bit block 87.168002 0.047 0.041 + 8192 1024 64-bit block 88.067198 0.093 0.081 + 16384 2048 64-bit block 88.863999 0.184 0.161 + 32768 4096 64-bit block 88.723201 0.369 0.323 + 65536 8192 64-bit block 87.993598 0.745 0.652 + 131072 16384 64-bit block 88.783997 1.476 1.292 + 262144 32768 64-bit block 91.366398 2.869 2.511 + 524288 65536 64-bit block 102.060795 5.137 4.495 + 1048576 131072 64-bit block 111.846399 9.375 8.203 + 2097152 262144 64-bit block 137.107205 15.296 13.384 + 4194304 524288 64-bit block 183.100796 22.907 20.044 + Runtime options after parsing command line arguments + min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0 + Note: Above is full list of options, any given test will use only a subset of these variables. + mype: 3 mype_node: 3 device name: NVIDIA GH200 120GB bus id: 1 + Runtime options after parsing command line arguments + min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0 + Note: Above is full list of options, any given test will use only a subset of these variables. + mype: 2 mype_node: 2 device name: NVIDIA GH200 120GB bus id: 1 + Runtime options after parsing command line arguments + min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0 + Note: Above is full list of options, any given test will use only a subset of these variables. + mype: 1 mype_node: 1 device name: NVIDIA GH200 120GB bus id: 1 + ``` diff --git a/docs/software/container-engine/guidelines-images/image-ompi.md b/docs/software/container-engine/guidelines-images/image-ompi.md new file mode 100644 index 00000000..07622b14 --- /dev/null +++ b/docs/software/container-engine/guidelines-images/image-ompi.md @@ -0,0 +1,578 @@ +[](){#ref-ce-guidelines-images-ompi} +# OpenMPI image + +This page describes a container image featuring the OpenMPI library as MPI (Message Passing Interface) implementation, with support for CUDA, Libfabric and UCX. + +This image is based on the [communication frameworks image][ref-ce-guidelines-images-commfwk], and thus it is suited for hosts with NVIDIA GPUs, like Alps GH200 nodes. + +A build of this image is currently hosted on the [Quay.io](https://quay.io/) registry at the following reference: +`quay.io/ethcscs/ompi:5.0.8-ofi1.22-cuda12.8`. + +## Contents + +- Ubuntu 24.04 +- CUDA 12.8.1 +- GDRCopy 2.5.1 +- Libfabric 1.22.0 +- UCX 1.19.0 +- OpenMPI 5.0.8 + +## Containerfile +```Dockerfile +FROM quay.io/ethcscs/comm-fwk:ofi1.22-ucx1.19-cuda12.8 + +ARG OMPI_VER=5.0.8 +RUN wget -q https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${OMPI_VER}.tar.gz \ + && tar xf openmpi-${OMPI_VER}.tar.gz \ + && cd openmpi-${OMPI_VER} \ + && ./configure --prefix=/usr --with-ofi=/usr --with-ucx=/usr --enable-oshmem \ + --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs \ + && make -j$(nproc) \ + && make install \ + && ldconfig \ + && cd .. \ + && rm -rf openmpi-${OMPI_VER}.tar.gz openmpi-${OMPI_VER} +``` + +!!! note + This image builds OpenSHMEM as part of the OpenMPI installation. This can be useful to support other SHMEM implementations like NVSHMEM. + +## Performance examples + +In this section we demonstrate the performance of the previously created OpenMPI image using it to build the OSU Micro-Benchmarks 7.5.1, and deploying the resulting image on Alps through the Container Engine to run a variety of benchmarks. + +A build of the image with the OSU benchmarks is available on the [Quay.io](https://quay.io/) registry at the following reference: +`quay.io/ethcscs/osu-mb:7.5-ompi5.0.8-ofi1.22-cuda12.8`. + +### OSU-MB Containerfile +```Dockerfile +FROM quay.io/ethcscs/ompi:5.0.8-ofi1.22-cuda12.8 + +ARG omb_version=7.5.1 +RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-${omb_version}.tar.gz \ + && tar xf osu-micro-benchmarks-${omb_version}.tar.gz \ + && cd osu-micro-benchmarks-${omb_version} \ + && ldconfig /usr/local/cuda/targets/sbsa-linux/lib/stubs \ + && ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS="-O3 -lcuda -lnvidia-ml" \ + --enable-cuda --with-cuda-include=/usr/local/cuda/include \ + --with-cuda-libpath=/usr/local/cuda/lib64 \ + CXXFLAGS="-lmpi -lcuda" \ + && make -j$(nproc) \ + && make install \ + && ldconfig \ + && cd .. \ + && rm -rf osu-micro-benchmarks-${omb_version} osu-micro-benchmarks-${omb_version}.tar.gz + +WORKDIR /usr/local/libexec/osu-micro-benchmarks/mpi +``` + +### Environment Definition File +```toml +image = "quay.io#ethcscs/osu-mb:7.5-ompi5.0.8-ofi1.22-cuda12.8" + +[env] +PMIX_MCA_psec="native" # (1)! +``` + +1. Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup. + +### Notes + +- Since OpenMPI uses PMIx for wire-up and communication between ranks, when using this image the `srun` option `--mpi=pmix` must be used to run successful multi-rank jobs. + +### Results + +=== "Point-to-point bandwidth, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bw --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.95 Pass + 2 1.90 Pass + 4 3.80 Pass + 8 7.61 Pass + 16 15.21 Pass + 32 30.47 Pass + 64 60.72 Pass + 128 121.56 Pass + 256 242.28 Pass + 512 484.54 Pass + 1024 968.30 Pass + 2048 1943.99 Pass + 4096 3870.29 Pass + 8192 6972.95 Pass + 16384 13922.36 Pass + 32768 18835.52 Pass + 65536 22049.82 Pass + 131072 23136.20 Pass + 262144 23555.35 Pass + 524288 23758.39 Pass + 1048576 23883.95 Pass + 2097152 23949.94 Pass + 4194304 23982.18 Pass + ``` + +=== "Point-to-point bandwidth, GPU-to-GPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bw --validation D D + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI-CUDA Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.90 Pass + 2 1.82 Pass + 4 3.65 Pass + 8 7.30 Pass + 16 14.56 Pass + 32 29.03 Pass + 64 57.49 Pass + 128 118.30 Pass + 256 227.18 Pass + 512 461.26 Pass + 1024 926.30 Pass + 2048 1820.46 Pass + 4096 3611.70 Pass + 8192 6837.89 Pass + 16384 13361.25 Pass + 32768 18037.71 Pass + 65536 22019.46 Pass + 131072 23104.58 Pass + 262144 23542.71 Pass + 524288 23758.69 Pass + 1048576 23881.02 Pass + 2097152 23955.49 Pass + 4194304 23989.54 Pass + ``` + + +=== "Point-to-point bandwidth, CPU-to-CPU memory, intra-node communication" + ```console + $ srun -N1 -n2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bw --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.96 Pass + 2 1.92 Pass + 4 3.85 Pass + 8 7.68 Pass + 16 15.40 Pass + 32 30.78 Pass + 64 61.26 Pass + 128 122.23 Pass + 256 240.96 Pass + 512 483.12 Pass + 1024 966.52 Pass + 2048 1938.09 Pass + 4096 3873.67 Pass + 8192 7100.56 Pass + 16384 14170.44 Pass + 32768 18607.68 Pass + 65536 21993.95 Pass + 131072 23082.11 Pass + 262144 23546.09 Pass + 524288 23745.05 Pass + 1048576 23879.79 Pass + 2097152 23947.23 Pass + 4194304 23980.15 Pass + ``` + + +=== "Point-to-point bandwidth, GPU-to-GPU memory, intra-node communication" + ```console + $ srun -N1 -n2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bw --validation D D + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI-CUDA Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.91 Pass + 2 1.83 Pass + 4 3.73 Pass + 8 7.47 Pass + 16 14.99 Pass + 32 29.98 Pass + 64 59.72 Pass + 128 119.13 Pass + 256 241.88 Pass + 512 481.52 Pass + 1024 963.60 Pass + 2048 1917.15 Pass + 4096 3840.96 Pass + 8192 6942.05 Pass + 16384 13911.45 Pass + 32768 18379.14 Pass + 65536 21761.73 Pass + 131072 23069.72 Pass + 262144 23543.98 Pass + 524288 23750.83 Pass + 1048576 23882.44 Pass + 2097152 23951.34 Pass + 4194304 23989.44 Pass + ``` + + +=== "Point-to-point bi-directional bandwidth, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bibw --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bibw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bibw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI Bi-Directional Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.93 Pass + 2 1.94 Pass + 4 3.89 Pass + 8 7.77 Pass + 16 15.61 Pass + 32 30.94 Pass + 64 62.10 Pass + 128 123.73 Pass + 256 247.77 Pass + 512 495.33 Pass + 1024 988.33 Pass + 2048 1977.44 Pass + 4096 3953.82 Pass + 8192 7252.82 Pass + 16384 14434.94 Pass + 32768 23610.53 Pass + 65536 33290.72 Pass + 131072 39024.03 Pass + 262144 42508.16 Pass + 524288 44482.65 Pass + 1048576 45575.40 Pass + 2097152 46124.45 Pass + 4194304 46417.59 Pass + ``` + + +=== "Point-to-point bi-directional bandwidth, GPU-to-GPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bibw --validation D D + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bibw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_bibw: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.97 Pass + 2 1.94 Pass + 4 3.89 Pass + 8 7.75 Pass + 16 15.55 Pass + 32 31.11 Pass + 64 61.95 Pass + 128 123.35 Pass + 256 250.91 Pass + 512 500.80 Pass + 1024 1002.29 Pass + 2048 2003.24 Pass + 4096 4014.15 Pass + 8192 7289.11 Pass + 16384 14717.42 Pass + 32768 22467.65 Pass + 65536 33136.69 Pass + 131072 38970.21 Pass + 262144 42501.28 Pass + 524288 44466.34 Pass + 1048576 45554.48 Pass + 2097152 46124.56 Pass + 4194304 46417.53 Pass + ``` + + +=== "Point-to-point latency, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_latency --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_latency: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./pt2pt/osu_latency: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 3.34 Pass + 2 3.34 Pass + 4 3.35 Pass + 8 3.34 Pass + 16 3.33 Pass + 32 3.34 Pass + 64 3.33 Pass + 128 4.32 Pass + 256 4.36 Pass + 512 4.40 Pass + 1024 4.46 Pass + 2048 4.61 Pass + 4096 4.89 Pass + 8192 8.31 Pass + 16384 8.95 Pass + 32768 9.76 Pass + 65536 11.16 Pass + 131072 13.98 Pass + 262144 19.41 Pass + 524288 30.21 Pass + 1048576 52.12 Pass + 2097152 95.26 Pass + 4194304 182.39 Pass + ``` + + +=== "All-to-all collective latency, CPU-to-CPU memory, multiple nodes" + ```console + $ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=omb-ompi ./collective/osu_alltoall --validation + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI All-to-All Personalized Exchange Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 12.46 Pass + 2 12.05 Pass + 4 11.99 Pass + 8 11.84 Pass + 16 11.87 Pass + 32 11.84 Pass + 64 11.95 Pass + 128 12.22 Pass + 256 13.21 Pass + 512 13.23 Pass + 1024 13.37 Pass + 2048 13.52 Pass + 4096 13.88 Pass + 8192 17.32 Pass + 16384 18.98 Pass + 32768 23.72 Pass + 65536 36.53 Pass + 131072 62.96 Pass + 262144 119.44 Pass + 524288 236.43 Pass + 1048576 519.85 Pass + ``` + + +=== "All-to-all collective latency, GPU-to-GPU memory, multiple nodes" + ```console + $ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=omb-ompi ./collective/osu_alltoall --validation -d cuda + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + /usr/local/libexec/osu-micro-benchmarks/mpi/./collective/osu_alltoall: /usr/lib/aarch64-linux-gnu/libnl-3.so.200: no version information available (required by /usr/lib64/libcxi.so.1) + + # OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 22.26 Pass + 2 22.08 Pass + 4 22.15 Pass + 8 22.19 Pass + 16 22.25 Pass + 32 22.11 Pass + 64 22.22 Pass + 128 21.98 Pass + 256 22.19 Pass + 512 22.20 Pass + 1024 22.37 Pass + 2048 22.58 Pass + 4096 22.99 Pass + 8192 27.22 Pass + 16384 28.55 Pass + 32768 32.60 Pass + 65536 44.88 Pass + 131072 70.15 Pass + 262144 123.30 Pass + 524288 234.89 Pass + 1048576 486.89 Pass + ``` + + +### Results without the CXI hook +On many Alps vClusters, the Container Engine is configured with the CXI hook enabled by default, enabling transparent access to the Slingshot interconnect. + +This section demonstrates the performance benefit of the CXI hook by explicitly disabling it through the EDF: +```console +$ cat .edf/omb-ompi-no-cxi.toml +image = "quay.io#ethcscs/osu-mb:7.5-ompi5.0.8-ofi1.22-cuda12.8" + +[env] +PMIX_MCA_psec="native" + +[annotations] +com.hooks.cxi.enabled="false" +``` + +=== "Point-to-point bandwidth, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmix --environment=omb-ompi-no-cxi ./pt2pt/osu_bw --validation + + # OSU MPI Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.16 Pass + 2 0.32 Pass + 4 0.65 Pass + 8 1.31 Pass + 16 2.59 Pass + 32 5.26 Pass + 64 10.37 Pass + 128 20.91 Pass + 256 41.49 Pass + 512 74.26 Pass + 1024 123.99 Pass + 2048 213.82 Pass + 4096 356.13 Pass + 8192 468.55 Pass + 16384 505.89 Pass + 32768 549.59 Pass + 65536 2170.64 Pass + 131072 2137.95 Pass + 262144 2469.63 Pass + 524288 2731.85 Pass + 1048576 2919.18 Pass + 2097152 3047.21 Pass + 4194304 3121.42 Pass + ``` + +=== "Point-to-point bandwidth, GPU-to-GPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmix --environment=omb-ompi-no-cxi ./pt2pt/osu_bw --validation D D + + # OSU MPI-CUDA Bandwidth Test v7.5 + # Datatype: MPI_CHAR. + # Size Bandwidth (MB/s) Validation + 1 0.06 Pass + 2 0.12 Pass + 4 0.24 Pass + 8 0.48 Pass + 16 0.95 Pass + 32 1.91 Pass + 64 3.85 Pass + 128 7.57 Pass + 256 15.28 Pass + 512 19.87 Pass + 1024 53.06 Pass + 2048 97.29 Pass + 4096 180.73 Pass + 8192 343.75 Pass + 16384 473.72 Pass + 32768 530.81 Pass + 65536 1268.51 Pass + 131072 1080.83 Pass + 262144 1435.36 Pass + 524288 1526.12 Pass + 1048576 1727.31 Pass + 2097152 1755.61 Pass + 4194304 1802.75 Pass + ``` + +=== "Point-to-point latency, CPU-to-CPU memory, inter-node communication" + ```console + $ srun -N2 --mpi=pmix --environment=omb-ompi-no-cxi ./pt2pt/osu_latency --validation + + # OSU MPI Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 28.92 Pass + 2 28.99 Pass + 4 29.07 Pass + 8 29.13 Pass + 16 29.48 Pass + 32 29.18 Pass + 64 29.39 Pass + 128 30.11 Pass + 256 32.10 Pass + 512 34.07 Pass + 1024 38.36 Pass + 2048 61.00 Pass + 4096 81.04 Pass + 8192 80.11 Pass + 16384 126.99 Pass + 32768 124.97 Pass + 65536 123.84 Pass + 131072 207.48 Pass + 262144 252.43 Pass + 524288 319.47 Pass + 1048576 497.84 Pass + 2097152 956.03 Pass + 4194304 1455.18 Pass + ``` + + +=== "All-to-all collective latency, CPU-to-CPU memory, multiple nodes" + ```console + $ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=omb-ompi-no-cxi ./collective/osu_alltoall --validation + + # OSU MPI All-to-All Personalized Exchange Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 137.85 Pass + 2 133.47 Pass + 4 134.03 Pass + 8 131.14 Pass + 16 134.45 Pass + 32 135.35 Pass + 64 137.21 Pass + 128 137.03 Pass + 256 139.90 Pass + 512 140.70 Pass + 1024 165.05 Pass + 2048 197.14 Pass + 4096 255.02 Pass + 8192 335.75 Pass + 16384 543.12 Pass + 32768 928.81 Pass + 65536 782.28 Pass + 131072 1812.95 Pass + 262144 2284.26 Pass + 524288 3213.63 Pass + 1048576 5688.27 Pass + ``` + + +=== "All-to-all collective latency, GPU-to-GPU memory, multiple nodes" + ```console + $ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=omb-ompi-no-cxi ./collective/osu_alltoall --validation -d cuda + + # OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.5 + # Datatype: MPI_CHAR. + # Size Avg Latency(us) Validation + 1 186.92 Pass + 2 180.80 Pass + 4 180.72 Pass + 8 179.45 Pass + 16 209.53 Pass + 32 181.73 Pass + 64 182.20 Pass + 128 182.84 Pass + 256 188.29 Pass + 512 189.35 Pass + 1024 237.31 Pass + 2048 231.73 Pass + 4096 298.73 Pass + 8192 396.10 Pass + 16384 589.72 Pass + 32768 983.72 Pass + 65536 786.48 Pass + 131072 1127.39 Pass + 262144 2144.57 Pass + 524288 3107.62 Pass + 1048576 5545.28 Pass + ``` diff --git a/docs/software/container-engine/guidelines-images/index.md b/docs/software/container-engine/guidelines-images/index.md new file mode 100644 index 00000000..87feed5e --- /dev/null +++ b/docs/software/container-engine/guidelines-images/index.md @@ -0,0 +1,35 @@ +[](){#ref-ce-guidelines-images} +# Guidelines for images on Alps + +This section offers some guidelines about creating and using container images that achieve good performance on the Alps research infrastructure. +The section focuses on foundational components (such as communication libraries) which are essential to enabling performant effective usage of Alps' capabilities, rather than full application use cases. +Synthetic benchmarks are also used to showcase quantitative performance. + +!!! important + The Containerfiles and examples provided in this section are intended to serve as general reference and starting point. + They are not meant to represent all possible combinations and versions of software capable of running efficiently on Alps. + + In the same vein, please note that the content presented here is not intended to represent images officially supported by CSCS staff. + +Below is a summary of the software suggested and demonstrated throughout this section: + +- Base components: + - CUDA 12.8.1 + - GDRCopy 2.5.1 + - Libfabric 1.22.0 + - UCX 1.19.0 +- MPI implementations + - MPICH 4.3.1 + - OpenMPI 5.0.8 +- Other programming libraries + - NVSHMEM 3.4.5 +- Synthetic benchmarks + - OSU Micro-benchmarks 7.5.1 + - NCCL Tests 2.17.1 + +The content is organized in pages which detail container images building incrementally upon each other: + +- a [base image][ref-ce-guidelines-images-commfwk] installing baseline libraries and frameworks (e.g. CUDA, libfabric) +- MPI implementations ([MPICH][ref-ce-guidelines-images-mpich], [OpenMPI][ref-ce-guidelines-images-ompi]) +- [NVSHMEM][ref-ce-guidelines-images-nvshmem] +- [NCCL tests][ref-ce-guidelines-images-nccl-tests] diff --git a/mkdocs.yml b/mkdocs.yml index e02a565f..5511b08e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -59,6 +59,13 @@ nav: - 'Using the Container Engine': software/container-engine/run.md - 'Hooks and native resources': software/container-engine/resource-hook.md - 'EDF reference': software/container-engine/edf.md + - 'Guidelines for images on Alps': + - software/container-engine/guidelines-images/index.md + - 'Communication frameworks image': software/container-engine/guidelines-images/image-comm-fwk.md + - 'MPICH image': software/container-engine/guidelines-images/image-mpich.md + - 'OpenMPI image': software/container-engine/guidelines-images/image-ompi.md + - 'NCCL Tests image': software/container-engine/guidelines-images/image-nccl-tests.md + - 'NVSHMEM image': software/container-engine/guidelines-images/image-nvshmem.md - 'Known issues': software/container-engine/known-issue.md - 'Building and Installing Software': - build-install/index.md