- Introduction
- Requirements
- Installation
- Build
- Run
- Profiling
- Options
- Troubleshooting
- Future work
- Authors
- License
- Resources
Recent research projects have investigated partitioning, acceleration, and data reduction techniques for improving the performance of Breadth First Search (BFS) and the related HPC benchmark, Graph500. However, few implementations have focused on cloud-based systems like Amazon's Web Services, which differ from HPC systems in several ways, most importantly in terms of network interconnect.
This codebase is evaluated in a related paper, Optimizing Communication for a 2D-Partitioned Scalable BFS, presented at the 2016 High Performance Extreme Computing Conference (HPEC 2016). This implementation supports optimizations to reduce the communication overhead of an accelerated, distributed BFS on an HPC system and a smaller cloud-like system that contains GPUs. In particular, this code implements an efficient 2D partitioning scheme and allreduce implementation, as well as different CPU-based compression schemes for reducing the overall amount of data shared between nodes. Timing and Score-P profiling results detailed in the paper demonstrate a dramatic reduction in row and column frontier queue data (up to 91%) and show that compression can improve performance for a bandwidth-limited cluster.
- C compiler; C++ Compiler with c++11 support. (tested on gcc 4.7)
- MPI implementation: tested on OpenMPI. MPICH2 is not fully supported.
- CUDA 6+ for NVIDIA GPU-based BFS.
- SSE2 support (SSE4+ support recommended) for SIMD compression.
- Scalasca (Score-P) and CUBE4+ for instrumentation and profiling (Optional)
- System packages:
libtool
,automake
$ wget https://github.com/UniHD-CEG/gpugraph500/archive/master.zip
$ unzip master.zip
$ cd gpugraph500-master
$ git clone https://github.com/UniHD-CEG/gpugraph500.git
$ cd gpugraph500
The code to compile is in the folder cpu_2d/
. To build the binary:
First build: (or when editing configure.ac
)
$ cd cpu_2d
$ ./autogen.sh [option1 option2 ...] # See (1) below.
$ make
Consecutive builds:
$ cd cpu_2d
$ ./configure [option1 option2 ...] # See (1) below.
$ make
(1) for further help check the available options or run ./configure --help
$ cd eval/
$ sbatch o16p8n.rsh 22 # (Replace 22 with Scale Factor)
Run a test with 16 proccesses in 8 nodes, using Scale Factor 21
$ cd cpu_2d/
$ mpirun -np 16 ../cpu_2d/g500 -s 22 -C 4 -gpus 1 -qs 2 -be "s4-bp128-d4" -btr 64 -btc 64
See a full description of the options and the available codecs below.
This application allows the code to be instrumented in zones using Score-P (Scalasca) with low overhead.
The names of the instrumented zones are listed below.
Zone (label) | Explanation |
---|---|
BFSRUN_region_vertexBroadcast | Initial vertices broadcast (No compression) |
BFSRUN_region_localExpansion | Predecessor List Reduction (No compression) |
BFSRUN_region_columnCommunication | Column communication phase (Compression) |
BFSRUN_region_rowCommunication | Row communication phase (Compression) |
BFSRUN_region_Compression | Row compression (type convertions + compression/ encoding) |
BFSRUN_region_Decompression | Row decompression (type convertions + decompression/ encoding) |
CPUSIMD_region_encode | compression or decompression/ encoding |
BFSRUN_region_vreduceCompr | Column compression (type convertions + compression/ encoding) |
BFSRUN_region_vreduceDecompr | Column decompression (type convertions + decompression/ encoding) |
The following example asumes an installation of CUBE and scalasca in $HOME/cube
and $HOME/scorep
$ cat >> ~/.bashrc << EOF
export G500_ENABLE_RUNTIME_SCALASCA=yes
export SCOREP_CUDA_BUFFER=48M
export SCOREP_CUDA_ENABLE=no
export SCOREP_ENABLE_PROFILING=true
export SCOREP_ENABLE_TRACING=false
export SCOREP_PROFILING_FORMAT=CUBE4
export SCOREP_TOTAL_MEMORY=12M
export SCOREP_VERBOSE=no
export SCOREP_PROFILING_MAX_CALLPATH_DEPTH=330
export LD_LIBRARY_PATH=$HOME/cube/lib:$LD_LIBRARY_PATH
export PATH=$HOME/cube/bin:$PATH
export LD_LIBRARY_PATH=$HOME/scorep/lib:$LD_LIBRARY_PATH
export PATH=$HOME/scorep/bin:$PATH
export LD_LIBRARY_PATH=$HOME/scorep/lib:$LD_LIBRARY_PATH
export PATH=$HOME/scorep/bin:$PATH
EOF
The variable G500_ENABLE_RUNTIME_SCALASCA
set to yes will enable the required runtime instrumentor of Scalasca.
See TurboPFOR in Resources
Results will be stored in folders with the format scorep-*
.
Possible ways of instrumenting:
-
The provided
scripts/Profiling/Statistics.sh
script. The options must be changed inside the script. Text output -
Using CUBE (text output)
$HOME/cube/bin/cube_stat -p -m time -r BFSRUN_region_Compression,BFSRUN_region_Decompression,CPUSIMD_region_encode,BFSRUN_region_vreduceCompr,BFSRUN_region_vreduceDecompr profile.cubex
Flag -m in cube_stat
may be set to: time or bytes_sent
- Using scorep-score (text ouptut)
$ scorep-score -r [zone1,zone2,zone3....] profile.cubex
See the available zones section, for further information.
- Using CUBE (graphical interface)
$ $HOME/cube/bin/cube profile.cubex
Lemire's SIMDCompression codecs | Notes |
---|---|
varintg8iu | |
fastpfor | |
varint | |
vbyte | |
maskedvbyte | |
streamvbyte | |
frameofreference | |
simdframeofreference | |
varintgb | Based on a talk by Jeff Dean (Google) |
s4-fastpfor-d4 | |
s4-fastpfor-dm | |
s4-fastpfor-d1 | |
s4-fastpfor-d2 | |
bp32 | |
ibp32 | |
s4-bp128-d1-ni | |
s4-bp128-d2-ni | |
s4-bp128-d4-ni | |
s4-bp128-dm | Codec used as default |
s4-bp128-d1 | |
s4-bp128-d2 | |
s4-bp128-d4 | |
for | Original FOR |
`configure' configures gpugraph500 1.0 to adapt to many kinds of systems.
Usage: ./configure [OPTION]... [VAR=VALUE]...
To assign environment variables (e.g., CC, CFLAGS...), specify them as
VAR=VALUE. See below for descriptions of some of the useful variables.
Defaults for the options are specified in brackets.
Configuration:
-h, --help display this help and exit
--help=short display options specific to this package
--help=recursive display the short help of all the included packages
-V, --version display version information and exit
-q, --quiet, --silent do not print `checking ...' messages
--cache-file=FILE cache test results in FILE [disabled]
-C, --config-cache alias for `--cache-file=config.cache'
-n, --no-create do not create output files
--srcdir=DIR find the sources in DIR [configure dir or `..']
Installation directories:
--prefix=PREFIX install architecture-independent files in PREFIX
[/usr/local]
--exec-prefix=EPREFIX install architecture-dependent files in EPREFIX
[PREFIX]
By default, `make install' will install all the files in
`/usr/local/bin', `/usr/local/lib' etc. You can specify
an installation prefix other than `/usr/local' using `--prefix',
for instance `--prefix=$HOME'.
For better control, use the options below.
Program names:
--program-prefix=PREFIX prepend PREFIX to installed program names
--program-suffix=SUFFIX append SUFFIX to installed program names
--program-transform-name=PROGRAM run sed PROGRAM on installed program names
System types:
--build=BUILD configure for building on BUILD [guessed]
--host=HOST cross-compile to build programs to run on HOST [BUILD]
Optional Features:
--disable-option-checking ignore unrecognized --enable/--with options
--disable-FEATURE do not include FEATURE (same as --enable-FEATURE=no)
--enable-FEATURE[=ARG] include FEATURE [ARG=yes]
--disable-dependency-tracking speeds up one-time build
--enable-dependency-tracking do not reject slow dependency extractors
--enable-bfs-basic-profiling
It is related with instrumentation. Displays
statistical data on each BFS run. (Enabled by
default)
--enable-other-basic-profiling
It is related with instrumentation. Displays
gpugraph500 default statistics. (Enabled by default)
--enable-scorep It is related with instrumentation. Enables
instrumentation with Scalasca/ScoreP. ScoreP must be
detected by ./configure. (Disabled by default)]
--enable-compression It is related with data compression. Enables data
compression through the network. This option is
available only when --enable-cuda (BFS runs using
CUDA) is active (default). (Enabled by default)
--enable-simd It is related with data compression. MPI packets
will be sent compressed using the PFOR-delta D.
Lemire SIMDCompression library. It is only active if
--enable-compression is selected. It will be enabled
by default if --enable-compression is active and no
compression method is selected. (Enabled by default)
--enable-simd+ It is related with data compression. MPI packets
will be sent compressed using a PFOR-delta improved
library: Turbo-PFOR. It is only active if
--enable-compression is selected. (Disabled by
default)
--enable-simt It is related with data compression. Use CUDA
implementation for data compression. Not implemented
yet. (Disabled by default)
--enable-debug-compression
It is related with data compression. Shows
statistics of compression rate, time of compression,
codec, ETC. (Disabled by default)
--enable-verify-compression
It is related with data compression. Sends both
compressed and decompressed data through the
network. Checks decompression after transmission.
(Disabled by default)
--enable-aggressive-optimizations
It is related with optimizations. Enables aggressive
compiler optimizations on the compiler. (Disabled
by default)
--enable-openmp It is related with optimizations. Enables or
disables both --enable-cuda-openmp and
--enable-general-openmp. This option overrides both
openmp settings. (Not set by default)
--enable-cuda-openmp Related with optimizations. Selects whether OpenMP
will be enabled. This option applies to CUDA C
files. (Disabled by default)
--enable-general-openmp It is related with optimizations. Selects whether
OpenMP will be enabled. This option applies to
general C and C++ files. (Disabled by default)
--enable-cuda Use the CUDA implementation of the BFS runs.
Requires NVIDIA hardware support. Enabled by default
--enable-ptxa-optimizations
It is related with optimizations. Selects whether
CUDA assembly (PTXAS) will be optimized or not. This
option will only be used if --enable-cuda is present
(default). The default PTXAS optimization is -O3.
(Disabled by default)
--enable-nvidia-architecture= fermi|kepler|auto|detect
Selects the NVIDIA target architecture. Requires --enable-cuda to be selected (default). Default option is 'detect'. In case detection does not succeed 'all'
mode is selected.
--enable-debug Provides extra traces at runtime. (Disabled by
default)
--enable-debugging It is related with debugging. Enables -g option on
compiler (debugging). (Disabled by default)
--enable-quiet It is related with debugging. Disable compile
mensages when running make. (Disabled by default)
--enable-portable-binary
disable compiler optimizations that would produce
unportable binaries
--enable-cc-warnings= no|minimum|yes|maximum|error
Turn on C compiler warnings. Default selection is
maximum
--enable-iso-c Try to warn if code is not ISO C
--enable-cxx-warnings= no|minimum|yes|maximum|error
Turn on C++ compiler warnings. Default selection is
maximum
--enable-iso-cxx Try to warn if code is not ISO C++
Optional Packages:
--with-PACKAGE[=ARG] use PACKAGE [ARG=yes]
--without-PACKAGE do not use PACKAGE (same as --with-PACKAGE=no)
--with-mpi=<path> absolute path to the MPI root directory. It should
contain bin/ and include/ subdirectories.
--with-mpicc=mpicc name of the MPI C++ compiler to use (default mpicc)
--with-mpicxx=mpicxx name of the MPI C++ compiler to use (default mpicxx)
--with-cuda=<path> Use CUDA library. If argument is <empty> that means
the library is reachable with the standard search
path "/usr" or "/usr/local" (set as default).
Otherwise you give the <path> to the directory which
contain the library.
--with-gcc-arch=<arch> use architecture <arch> for gcc -march/-mtune,
instead of guessing
--with-opencl=<path> prefix to location of OpenCL include directory
[default=auto]
--with-scorep=<path> Use SCOREP profiler. If argument is <empty> that
means the library is reachable with the standard
search path (set as default). Otherwise you give the
<path> to the directory which contain the library.
Some influential environment variables:
CXXFLAGS C++ compiler flags
CFLAGS C compiler flags
CXX C++ compiler command
LDFLAGS linker flags, e.g. -L<lib dir> if you have libraries in a
nonstandard directory <lib dir>
LIBS libraries to pass to the linker, e.g. -l<library>
CPPFLAGS (Objective) C/C++ preprocessor flags, e.g. -I<include dir> if
you have headers in a nonstandard directory <include dir>
CC C compiler command
CPP C preprocessor
Use these variables to override the choices made by `configure' or to help
it to find libraries and programs with nonstandard names/locations.
Report bugs to the package provider.
- -s Number - (SCALE_FACTOR)
- -C Number - (2^SCALE_FACTOR) - This is also the value used in the the -np flag of
mpirun
- -gpus Number - Number of GPUs per node. Currently, only the value 1 is fully tested.
- -qs Number - Queue size as in B40C implementation, from 1 to 2 (e.g. 1.3).
- -be "Codec" - Codec used when compression is enabled (--enable-compression)
- -btc Number - Row Threshoold number: Frontier queue minimum size at which compression would start. Allows disabling compression for small queue sizes.
e.g. g500 -s 22 -C 4 -gpus 1 -qs 1.1 -be "s4-bp128-d4" -btc 64
- Q: In the .out file of Slurm/ Sbatch execution I get the text:
S=C=A=N: Abort: No SCOREP instrumentation found in target ../cpu_2d/g500
- A: The instrumentation is activated for the runtime execution (i.e: the binary is being run prefixed with scalasca).
Disable with:
$ export G500_ENABLE_RUNTIME_SCALASCA=no
Computer Engineering Group at Ruprecht-Karls University of Heidelberg,
and
School of Computer Science at Georgia Institute of Technology
- Duane Merrill's BC40 (back40computing) is licenced under BSD.
- SIMDcompressionAndIntersection is licenced under Apache 2 Licence.
- Alenka GPU database engine is licensed under Apache 2 License.
Copyright (c) 2016, Computer Engineering Group at Ruprecht-Karls University of Heidelberg, Germany. All rights reserved. Copyright (c) 2016, School of Computer Science at Georgia Institute of Technology, USA. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of Ruprecht-Karls University of Heidelberg, Georgia Institute of Technology nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL COPYRIGHT HOLDER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
Embed the compression encoding routines in the communication module (e.g. by using lambda functions). The current implementation does not allow code inlining by the Linker (due to the use of the
virtual
keyword). -
Remove type conversion in the compression calls (Using a GPU PFOR-compression implementation?). The associated paper demonstrates that while compression reduces overheads and data movement, the type conversion from 64 bit to 32 bit values for CPU-based compression can outweigh communication savings.
-
Add compression to the Predecessor List Reduction phase (this uncompressed data represents the ~90% of the total data movement once row and column data is sent compressed). Currently, this process is done as follows: (i) each node computes its Frontier Queue (FQ) vertices by aplying its FQ bitmap and transmits this, to the rest using a P2P communication fashion; the same process is done with the node's Predecessor List (PL) and the node's PL bitmap. (ii) this new integer sequence is transmitted using uncompressed MPI_AllGatherv. As result of applying the bitmap and then transmitting, the sent data results in a random-like, non ordered sequence of integers, and thus, it loses the data characteristics that enable this to achieve the reduction of the rows and columns transfer. A new approach, consisting on compressing first and then transfer would change this. Sending the four compressed data structures (FQ, FQ bitmap, PL, PL bitmap) using MPI_AllGatherv would enable each node to compute the general Predeccesor List, avoiding the P2P and MPI_AllGatherv uncompressed transfers.