[ISCA'24] PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMMs
This repository contains the tutorial code for PID-Comm, a fast and flexible collective communication framework for commodity Processing-in-DIMMs. Please cite the paper for more information.
- Multicore x86_64 CPU with support for AVX512 instructions.
- DDR4 channels equipped with UPMEM DIMMs
- g++
- python3.6
- matplotlib
- numpy
- Pandas
- Scipy
- UPMEM SDK driver (version 2021.3.0, available from the [UPMEM website] (https://sdk.upmem.com/).)
- tutorial/ #Turorial code for PID-Comm
- pidcomm_lib/ #Implementations of supported communication primitives and DPU binary codes.
- upmem-2021.3.0_opt/ #Modified UPMEM driver for PID-Comm
- upmem-2021.3.0_opt/include/dpu/pidcomm_lib.h #Declarations of supported communication primitives
cd {your PID-Comm dir};
source ./upmem-2021.3.0_opt/upmem_env.sh
Collective communication is a communication pattern that incurs interaction between nodes within a communicator. PID-Comm supports eight communication primitives below:
Many parallel applications, such as dnn, require the reduced results from all nodes. The function of PIDComm_AllReduce appears as follows:
void pidcomm_all_reduce(
hypercube_manager* manager,
char* comm_dimensions,
int total_data_size,
int src_mram_offset,
int dst_mram_offset,
int data_type,
PIDCOMM_OPERATOR reduction_type)
Assume there are eight nodes in communication, each containing 4 integers [0, 1, 2, 3]. After executing pidcomm_all_reduce(), all nodes will contain the same result [0, 8, 16, 24], the sum of the integers in the communicating nodes.
The instructions below provide instructions on how to use PID-Comm.
First, include the PID-Comm header files with #include <pidcomm.h>
.
After the previous step, you need to configure the hypercube settings.
Here is an example:
uint32_t nr_dpus = 1024; //The number of DPUs
uint32_t dimension=3; //Dimension of the hypercube
uint32_t axis_len[dimension]; //The number of DPUs for each axis of the hypercube
axis_len[0]=32; //x-axis
axis_len[1]=32; //y-axis
axis_len[2]=1; //z-axis
Then, please set the remaining variables required for the PID-Comm.
uint32_t start_offset=0; //Offset of source.
uint32_t target_offset=0; //Offset of destination.
uint32_t data_size_per_dpu = 64*axis_len[0]; //data size for each nodes
uint32_t buffer_offset=1024*1024*32; //For effective communication, PID-Comm required buffer. The buffer's offset must be greater than the sum of the start_offset and the data_size_per_dpu.
Allocate the DPUs and then initialize the hypercube manager.
DPU_ASSERT(dpu_alloc(nr_dpus, NULL, &dpu_set));
hypercube_manager* hypercube_manager = init_hypercube_manager(dpu_set, dimension, axis_len);
Now PID-Comm's settings have been completed. The following line of code is used to execute pidcomm_allreduce(). The parameter "100" refers to the axis used in communication, which is the x-axis in this case.
pidcomm_all_reduce(hypercube_manager, "100", data_size_per_dpu, start_offset, target_offset, buffer_offset, sizeof(T), 0);
Note that a dummy binary file, DPU_BINARY_USER, is loaded in the DPUs for the tutorial. A custom binary file may be used to replace our current dummy binary file.
A script is also available to test the tutorial code.
cd tutorial;
./AllReduce_demo.sh;
GNN_RSAR is a GNN inference varient using reducescatter and allreduce in between each layer. GNN_ARAG is a GNN inference varient using allreduce and allgather in between each layer. Our version takes in a COO graph matrix as an input and preprocesses the graph first (partitioning, zero-padding, etc.). The base structure of the code is from SparseP, available at (https://github.com/CMU-SAFARI/SparseP).
The first kernel computes SPMM between the given sparse graph matrix and the feature matrix. The second kernel computes GEMM between the mid-result matrix and weight matrix. The result of the second computation is used again in the next layer as the new feature matrix.
There is another kernel names data_relocate_comm and data_relocate_AG, each for GNN_RSAR and GNN_ARAG. This is used to relocate data after computation, so as to easily execute collective communication for the next layer.
A sample run.sh script is available. The script allows the user to change # of PEs, feature/weight row size, datatype, and whether to use PIDComm or not. When running the script, be sure to
- use 64/256/1024 as the # of PEs
- use large enough feature size ((sizeof(datatype) * feature_size) => 256)
- use INT8 or INT32 for datatytpes to avoid any errors or miscalculations.
pubmed and citeseer are given as input matrices. To test out other matrices, make sure to use a COO matrix for the application code to work.
To add other benchmarks later on...