TorchBraid

This package implements a layer-parallel approach to training neural ODEs, and neural networks. Algorithmically multigrid-in-time is used to expose parallelism in the forward and backward propagation phases used to compute the gradient. The neural network interface is build on PyTorch, while the backend uses XBraid (a C library) for multigrid-in-time.

If you are having trouble with the GPU/MPI interaction, including low communication as a result of unneccessary host-device transfers (e.g. not using GPU aware MPI), please see GPU Direct Communication.

Build TorchBraid: pip (recommended):

Optional: create a new virtual environment

python -m venv tb-env
source tb-env/bin/activate
Install using pip. From inside TorchBraid directory, do
pip install .

If a development environment is desired, do
pip install -e .
Then all changes in the .py files are directly applicable in the installation. Changes to .pyx files require a re-installation.

You can also install directly from github using
pip install git+ssh://git@github.com/Multilevel-NN/TorchBraid.git
or the HTTP equivalent.
Test the installation: See Run Unit Tests

Build TorchBraid: Conda

With conda the easiest path is to use the pip install for TorchBraid

conda create -n tb-env python=3.10
conda activate tb-env
pip install git+ssh://git@github.com/Multilevel-NN/TorchBraid.git  # or local equivalent

For testing see Run Unit Tests

Run Unit Tests

Make

make tests tests-serial
TOX (may need to install tox)
tox

The package tox is used for testing in a continuous integration sense and automatically creates and populates a new python environment. However, if you have an environment that already satisfies the dependency requirements you can run the test commands directly using tox-direct.
1. Install tox-direct `pip install tox-direct'
2. Run commands tox --direct

Test run
cd examples/mnist/
mpirun -n 2 python mnist_script.py --percent-data 0.01

GPU direct communication

TorchBraid uses direct GPU communication when running simulations on GPUs. For this, Torchbraid requires a CUDA-aware MPI version ( see here or here for more information). A simple first test to determine if your system supports CUDA-aware MPI is to execute the command

ompi_info --parsable -l 9 --all | grep mpi_built_with_cuda_support:value

This command returns a string with true or false at the end. However, in our experiments, it was not always sufficient to check that this value is true. One way to test whether direct GPU communication works on your system is to run:

make tests-direct-gpu

If the test works, your MPI version supports direct GPU communication. You should see output that looks like (note that the output header that tries to explain how the test result should be interpreted).


******************************************************************
* This script is to be run with two MPI ranks and                *
* tests the availability of GPU/MPI direct                       *
* communication. This is _required_ for TorchBraid when          *
* GPUs are used. This test will fail if either:                  *
*                                                                *
*    1. Torch was not built with GPUs, or GPUs are unavailable   * 
*    2. GPU aware MPI is not available (NVLINK with Nvidia)      *
*                                                                *
* If the test is successful, the last line on rank 0 will output *
*                                                                *
*    "PASSED: GPU aware MPI is available"                        *
*                                                                *
* While failures are indicated by:                               *
*                                                                *
*    "FAILED: GPU aware MPI is NOT available"                    *
*                                                                *
* Followed by a brief explaination of the type of failure seen.  *
* It's possible that a segfault can occur on some untested.      *
* platforms. That should be viewed as GPU aware MPI not being    *
* available.                                                     *
******************************************************************

Check For GPU-Direct Support
-- compile time: This MPI library has CUDA-aware support.
-- run time:This MPI library has CUDA-aware support.

Check For GPU-Direct Support
-- compile time: This MPI library has CUDA-aware support.
-- run time:This MPI library has CUDA-aware support.

PASSED: GPU aware MPI is available

If the final line says FAILED then your MPI version does not support direct GPU communication. For instance, if you don't have CUDA enabled, then the error will look like:


******************************************************************
* This script is to be run with two MPI ranks and                *
* tests the availability of GPU/MPI direct                       *
* communication. This is _required_ for TorchBraid when          *
* GPUs are used. This test will fail if either:                  *
*                                                                *
*    1. Torch was not built with GPUs, or GPUs are unavailable   * 
*    2. GPU aware MPI is not available (NVLINK with Nvidia)      *
*                                                                *
* If the test is successful, the last line on rank 0 will output *
*                                                                *
*    "PASSED: GPU aware MPI is available"                        *
*                                                                *
* While failures are indicated by:                               *
*                                                                *
*    "FAILED: GPU aware MPI is NOT available"                    *
*                                                                *
* Followed by a brief explaination of the type of failure seen.  *
* It's possible that a segfault can occur on some untested.      *
* platforms. That should be viewed as GPU aware MPI not being    *
* available.                                                     *
******************************************************************

Check For GPU-Direct Support
-- compile time: This MPI library does NOT have CUDA-aware support.
-- run time:This MPI library does not have CUDA-aware support.

Check For GPU-Direct Support
-- compile time: This MPI library does NOT have CUDA-aware support.
-- run time:This MPI library does not have CUDA-aware support.

FAILED: GPU aware MPI is NOT available - "MPIX_Query_cuda_support" test failed.

We also check the MPIX_Query_cuda_support command available in most MPI libraries. Finally, it's possible, due to the range of implementations, that the script will raise a seg fault if the GPU direct communication is not supported. If such a case arises, then feel free to reach out with a description of the MPI implementation and version, CUDA version, and the platform being run on.

Build TorchBraid: Makefile (advanced):

Link to the Make instructions

Publications

Moon, Gordon Euhyun, and Eric C. Cyr. "Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences." ICLR, 2022. Arxiv Link
Cyr, Eric C., Stefanie Günther, and Jacob B. Schroder. "Multilevel Initialization for Layer-Parallel Deep Neural Network Training." arXiv preprint arXiv:1912.08974 (2019). Arxiv Link
Günther, Stefanie, Lars Ruthotto, Jacob B. Schroder, Eric C. Cyr, and Nicolas R. Gauger. "Layer-parallel training of deep residual neural networks." SIAM Journal on Mathematics of Data Science 2, no. 1 (2020): 1-23. Link

Name		Name	Last commit message	Last commit date
Latest commit History 848 Commits
.github/workflows		.github/workflows
drivers		drivers
examples		examples
scripts		scripts
src/torchbraid		src/torchbraid
tests		tests
.gitignore		.gitignore
Copyright.txt		Copyright.txt
LICENSE.md		LICENSE.md
MAKEINSTRUCTIONS.md		MAKEINSTRUCTIONS.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
TorchBraid-v0.1.png		TorchBraid-v0.1.png
makefile.inc.example		makefile.inc.example
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TorchBraid

Build TorchBraid: pip (recommended):

Build TorchBraid: Conda

Run Unit Tests

GPU direct communication

Build TorchBraid: Makefile (advanced):

Publications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

Multilevel-NN/torchbraid

Folders and files

Latest commit

History

Repository files navigation

TorchBraid

Build TorchBraid: pip (recommended):

Build TorchBraid: Conda

Run Unit Tests

GPU direct communication

Build TorchBraid: Makefile (advanced):

Publications

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages