Skip to content

AIS-SNU/PID-Comm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ISCA'24] PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMMs

This repository contains the tutorial code for PID-Comm, a fast and flexible collective communication framework for commodity Processing-in-DIMMs. Please cite the paper for more information.

Required Hardware

  • Multicore x86_64 CPU with support for AVX512 instructions.
  • DDR4 channels equipped with UPMEM DIMMs

Prerequisites

  • g++
  • python3.6
  • matplotlib
  • numpy
  • Pandas
  • Scipy
  • UPMEM SDK driver (version 2021.3.0, available from the [UPMEM website] (https://sdk.upmem.com/).)

Directories

  • tutorial/ #Turorial code for PID-Comm
  • pidcomm_lib/ #Implementations of supported communication primitives and DPU binary codes.
  • upmem-2021.3.0_opt/ #Modified UPMEM driver for PID-Comm
  • upmem-2021.3.0_opt/include/dpu/pidcomm_lib.h #Declarations of supported communication primitives

Environment Setup

cd {your PID-Comm dir};
source ./upmem-2021.3.0_opt/upmem_env.sh

What is Collective Communication?

Collective communication is a communication pattern that incurs interaction between nodes within a communicator. PID-Comm supports eight communication primitives below:

Tutorial: PIDComm_All_Reduce()

Many parallel applications, such as dnn, require the reduced results from all nodes. The function of PIDComm_AllReduce appears as follows:

void pidcomm_all_reduce(
    hypercube_manager* manager,
    char* comm_dimensions,
    int total_data_size,
    int src_mram_offset,
    int dst_mram_offset,
    int data_type,
    PIDCOMM_OPERATOR reduction_type)

Assume there are eight nodes in communication, each containing 4 integers [0, 1, 2, 3]. After executing pidcomm_all_reduce(), all nodes will contain the same result [0, 8, 16, 24], the sum of the integers in the communicating nodes.

The instructions below provide instructions on how to use PID-Comm. First, include the PID-Comm header files with #include <pidcomm.h>. After the previous step, you need to configure the hypercube settings. Here is an example:

uint32_t nr_dpus = 1024; //The number of DPUs
uint32_t dimension=3; //Dimension of the hypercube
uint32_t axis_len[dimension]; //The number of DPUs for each axis of the hypercube
axis_len[0]=32; //x-axis
axis_len[1]=32; //y-axis
axis_len[2]=1;  //z-axis

Then, please set the remaining variables required for the PID-Comm.

uint32_t start_offset=0; //Offset of source.
uint32_t target_offset=0; //Offset of destination.
uint32_t data_size_per_dpu = 64*axis_len[0]; //data size for each nodes
uint32_t buffer_offset=1024*1024*32; //For effective communication, PID-Comm required buffer. The buffer's offset must be greater than the sum of the start_offset and the data_size_per_dpu.

Allocate the DPUs and then initialize the hypercube manager.

DPU_ASSERT(dpu_alloc(nr_dpus, NULL, &dpu_set));
hypercube_manager* hypercube_manager = init_hypercube_manager(dpu_set, dimension, axis_len);

Now PID-Comm's settings have been completed. The following line of code is used to execute pidcomm_allreduce(). The parameter "100" refers to the axis used in communication, which is the x-axis in this case.

pidcomm_all_reduce(hypercube_manager, "100", data_size_per_dpu, start_offset, target_offset, buffer_offset, sizeof(T), 0);

Note that a dummy binary file, DPU_BINARY_USER, is loaded in the DPUs for the tutorial. A custom binary file may be used to replace our current dummy binary file.

A script is also available to test the tutorial code.

cd tutorial;
./AllReduce_demo.sh;

Benchmarks: GNN_RSAR & GNN_ARAG

GNN_RSAR is a GNN inference varient using reducescatter and allreduce in between each layer. GNN_ARAG is a GNN inference varient using allreduce and allgather in between each layer. Our version takes in a COO graph matrix as an input and preprocesses the graph first (partitioning, zero-padding, etc.). The base structure of the code is from SparseP, available at (https://github.com/CMU-SAFARI/SparseP).

The first kernel computes SPMM between the given sparse graph matrix and the feature matrix. The second kernel computes GEMM between the mid-result matrix and weight matrix. The result of the second computation is used again in the next layer as the new feature matrix.

There is another kernel names data_relocate_comm and data_relocate_AG, each for GNN_RSAR and GNN_ARAG. This is used to relocate data after computation, so as to easily execute collective communication for the next layer.

A sample run.sh script is available. The script allows the user to change # of PEs, feature/weight row size, datatype, and whether to use PIDComm or not. When running the script, be sure to

  • use 64/256/1024 as the # of PEs
  • use large enough feature size ((sizeof(datatype) * feature_size) => 256)
  • use INT8 or INT32 for datatytpes to avoid any errors or miscalculations.

pubmed and citeseer are given as input matrices. To test out other matrices, make sure to use a COO matrix for the application code to work.

TO DO

To add other benchmarks later on...