VariableGEMM

This repo delves into enhancing the inference performance of Mixture-of-Experts (MoE) models, a technique leveraging sparsity to achieve high capacities with moderate computational costs. We address the challenge of load-imbalanced computations during dynamic routing within MoE layers, which hampers real-world applications on hardware accelerators like GPUs and TPUs. We integrate existing libraries such as CuBLAS and CuSparse to augment our implementation. Our design includes a MoE layer with 64 experts, each processing variable-sized batches of inputs. We utilize CuBLAS's cublasSGemmGroupedBatched function for batched GEMM tasks with variable-sized entries, transforming inputs into column-major format and segregating them into uniform-sized groups. This allows us to quantify the benefits of batching non-uniform GEMM operations against a baseline implementation. We evaluate our approach across three model sizes with varying numbers of experts and observe superior performance of variable-sized batching over the baseline, confirming the efficiency gains. Additionally, we note that increasing batch size leads to reduced computation time due to enhanced parallelization.

Steps to run on cuda2.cims.nyu.edu

One-time setup

Pull the cuda-12.4 singularity image for Centos 7 using singularity. singularity pull nvidia/cuda:12.4.0-runtime-centos7
Update the image path in run-cuda-12.4.bash, it's currently set to /tmp/cuda-12.4.sif.
Create the build directory and setup make.
```
mkdir build
cd build
cmake ..
```

Before each run

Launch a cuda-12.4 environment using ./run-cuda-12.4.bash.
Go to build directory, cd build
Run make to compile the code
Run ./test to run basic tests.
Run ./train_moemlp to run inference on the MOE model.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.vscode		.vscode
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
download.py		download.py
lab-1.md		lab-1.md
lab-2.md		lab-2.md
mnist_mlp.ipynb		mnist_mlp.ipynb
plot.py		plot.py
pytorch_operator.py		pytorch_operator.py
run-cuda-12.4.bash		run-cuda-12.4.bash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VariableGEMM

Steps to run on cuda2.cims.nyu.edu

One-time setup

Before each run

About

Releases

Packages

Contributors 3

Languages

sujanamaithili/VariableGEMM

Folders and files

Latest commit

History

Repository files navigation

VariableGEMM

Steps to run on cuda2.cims.nyu.edu

One-time setup

Before each run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages