Legion is a GPU-initiated system with multi-GPU cache for large-scale GNN training.

$ git clone https://github.com/RC4ML/Legion.git

1. Hardware

Hardware Used in Our Paper

All platforms are bare-metal machines. Table 1

Platform	CPU-Info	#sockets	#NUMA nodes	CPU Memory	PCIe	GPUs	NVLinks
DGX-V100	96*Intel(R) Xeon(R) Platinum 8163 CPU @2.5GHZ	2	1	384GB	PCIe 3.0x16, 4*PCIe switches, each connecting 2 GPUs	8x16GB-V100	NVLink Bridges, Kc = 2, Kg = 4
Siton	104*Intel(R) Xeon(R) Gold 5320 CPU @2.2GHZ	2	2	1TB	PCIe 4.0x16, 2*PCIe switches, each connecting 4 GPUs	8x40GB-A100	NVLink Bridges, Kc = 4, Kg = 2
DGX-A100	128*Intel(R) Xeon(R) Platinum 8369B CPU @2.9GHZ	2	1	1TB	PCIe 4.0x16, 4*PCIE switches, each connecting 2 GPUs	8x80GB-A100	NVSwitch, Kc = 1, Kg = 8

Kc means the number of groups in which GPUs connect each other. And Kg means the number of GPUs in each group.

2. Software

Legion's software is light-weighted and portable. Here we list some tested environment.

Nvidia Driver Version: 515.43.04
CUDA 11.7
GCC/G++ 11.4.0
OS: Ubuntu(other linux systems are ok)
Intel PCM(according to OS version)

$ wget https://download.opensuse.org/repositories/home:/opcm/xUbuntu_18.04/amd64/pcm_0-0+651.1_amd64.deb

pytorch-cu117, torchmetrics

$ pip3 install torch-cu1xx

dgl 1.1.0

$ pip3 install  dgl -f https://data.dgl.ai/wheels/cu1xx/repo.html

MPI-3.1

3. Prepare Datasets and Graph Partitioning

Datasets are from OGB (https://ogb.stanford.edu/), Standford-snap (https://snap.stanford.edu/), and Webgraph (https://webgraph.di.unimi.it/). Here is an example of preparing datasets for Legion.

Uk-Union Datasets

Refer to README in dataset directory for more instructions

$ bash prepare_datasets.sh

Partition Uk-Union

gpu_num represents all gpu numbers you want to use, Legion will partition the graph according to underlying NVlink topology Note that this step would consume a large volume of CPU memory.

$ python graph_partitioning.py --dataset_name 'ukunion' --gpu_num 2

4. Build Legion from Source

$ bash build.sh

4. Run Legion

There are three steps to train a GNN model in Legion. In these steps, you need to change to root user for PCM. (2024.3.11, to solve PCM bugs for general platforms, I disable PCM for now)

Step 1. Open msr by root for PCM

$ modprobe msr

Step 2. Start Legion Server

$ python legion_server.py --dataset_path 'dataset' --dataset_name ukunion --train_batch_size 8000 --fanout [25,10] --gpu_number 2 --epoch 2 --cache_memory 38000000

Step 3. Run Legion Training

After Legion outputs "System is ready for serving", then start training by:

$ python training_backend/legion_graphsage.py --class_num 2  --features_num 128 --hidden_dim 256 --hops_num 2 --gpu_number 2 --epoch 2

I will continusly work on this to improve the running process for easier use.

Cite this work

If you use it in your paper, please cite our work

@inproceedings {sun2023legion,
author = {Jie Sun and Li Su and Zuocheng Shi and Wenting Shen and Zeke Wang and Lei Wang and Jie Zhang and Yong Li and Wenyuan Yu and Jingren Zhou and Fei Wu},
title = {Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training},
booktitle = {2023 USENIX Annual Technical Conference (USENIX ATC 23)},
year = {2023},
pages = {165--179}
}

Future Features of Legion

We will open-source SSD support for Legion in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
dataset		dataset
detail_parameter_settings		detail_parameter_settings
sampling_server		sampling_server
training_backend		training_backend
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
graph_partitioning.py		graph_partitioning.py
legion_server.py		legion_server.py
meta_config		meta_config
prepare_dataset.sh		prepare_dataset.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legion is a GPU-initiated system with multi-GPU cache for large-scale GNN training.

1. Hardware

Hardware Used in Our Paper

2. Software

3. Prepare Datasets and Graph Partitioning

Uk-Union Datasets

Partition Uk-Union

4. Build Legion from Source

4. Run Legion

Step 1. Open msr by root for PCM

Step 2. Start Legion Server

Step 3. Run Legion Training

Cite this work

Future Features of Legion

About

Releases

Packages

Languages

RC4ML/Legion

Folders and files

Latest commit

History

Repository files navigation

Legion is a GPU-initiated system with multi-GPU cache for large-scale GNN training.

1. Hardware

Hardware Used in Our Paper

2. Software

3. Prepare Datasets and Graph Partitioning

Uk-Union Datasets

Partition Uk-Union

4. Build Legion from Source

4. Run Legion

Step 1. Open msr by root for PCM

Step 2. Start Legion Server

Step 3. Run Legion Training

Cite this work

Future Features of Legion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages