Skip to content

Commit

Permalink
Merge pull request #469 from JuliaHealth/distributed
Browse files Browse the repository at this point in the history
Include more specific SLURM instructions
  • Loading branch information
cncastillo authored Aug 23, 2024
2 parents a26d598 + a6a1b28 commit e805677
Showing 1 changed file with 26 additions and 21 deletions.
47 changes: 26 additions & 21 deletions docs/src/how-to/4-run-distributed-simulations.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,26 +10,28 @@ The following two examples demonstrate how to use Distributed.jl to run a simula

## Using Multiple GPUs

To run a simulation using multiple GPUs, the phantom object can be divided using the kfoldperm function. Distributed.jl can then be used to start one Julia worker process per available device so that each device simulates a different part of the object. The results can then be fetched asynchronously by the main process and combined to produce a final signal. This process is shown in the diagram below:
To run a simulation using multiple GPUs, the phantom object can be divided using the kfoldperm function. Distributed.jl can then be used to start one Julia worker process per available device so that each device simulates a different part of the object. The results can then be fetched asynchronously by the main process and combined to produce a final signal. This is shown in the following diagram:

```@raw html
<p align="center"><img width="90%" src="../../assets/KomamultiGPU.svg"/></p>
```

The code for doing so is shown below:
The SLURM SBATCH script below requests 4 GPUs, all on a single computer. The Julia code then distributes work among each GPU:

!!! details "SLURM Script Requesting Multiple GPUs"

```sh
#!/bin/bash
#SBATCH --job-name # Enter job name
#SBATCH -t # Enter max runtime for job
#SBATCH -p # Enter partition on which to run the job
#SBATCH --cpus-per-task=1 # Request 1 CPU
#SBATCH --gpus= # Enter number of GPUs to request
#SBATCH -o # Enter file path to write stdout to
#SBATCH -e # Enter file path to write stderr to

#SBATCH --job-name KomaDistributed # Job name
#SBATCH -t 0-00:30 # Max runtime for job
#SBATCH -p batch # Enter partition on which to run the job
#SBATCH --ntasks=1 # 1 task
#SBATCH --cpus-per-task=1 # Request 1 CPU
#SBATCH --gpus=4 # Request 4 GPUs
#SBATCH -o /mnt/workspace/%u/slurm-out/%test.out # Enter file path to write stdout to
#SBATCH -e /mnt/workspace/%u/slurm-out/%test.err # Enter file path to write stderr to

module load julia/1.10.2
julia script.jl
```

Expand Down Expand Up @@ -59,26 +61,29 @@ end

## Using Multiple Nodes in an HPC Cluster

The script below uses the package ClusterManagers.jl to initialize worker processes on a SLURM cluster based on the number of tasks specified in the #SBATCH --ntasks directive. This can be useful to divide simulation work among multiple compute nodes if the problem is too large to fit into memory for a single computer, or if the number of desired workers is greater than the typical number of CPU cores available. An illustration of this is shown below:
This example uses the package ClusterManagers.jl to initialize worker processes on a SLURM cluster based on the number of tasks specified in the #SBATCH --ntasks directive. This can be useful to divide simulation work among multiple compute nodes if the problem is too large to fit into memory for a single computer, or if the number of desired workers is greater than the typical number of CPU cores available. An illustration of this is shown below:

```@raw html
<p align="center"><img width="90%" src="../../assets/KomamultiNodeCPU.svg"/></p>
```

This SBATCH script requests 20 separate nodes, with each taking a single task. The Julia code is similar to the example for multiple GPUs, but initializes the processes slightly differently:

!!! details "SLURM Script Requesting Multiple Nodes"

```sh
#!/bin/bash
#SBATCH --job-name # Enter job name here
#SBATCH -t # Enter max runtime for job
#SBATCH -p # Enter partition on which to run the job
#SBATCH --nodes # Enter number of nodes on which to run the job
#SBATCH --ntasks # Should be equal to number of nodes
#SBATCH --ntasks-per-node=1 # Run each task on a separate node
#SBATCH --cpus-per-task # Enter number of CPU threads to use per node
#SBATCH -o # Enter file path to write stdout to
#SBATCH -e # Enter file path to write stderr to

#SBATCH --job-name KomaDistributed # Job name
#SBATCH -t 0-00:30 # Max runtime for job
#SBATCH -p batch # Enter partition on which to run the job
#SBATCH --nodes=20 # 20 nodes
#SBATCH --ntasks=20 # 20 tasks
#SBATCH --ntasks-per-node=1 # 1 task per node
#SBATCH --cpus-per-task=4 # 4 CPUs per task
#SBATCH -o /mnt/workspace/%u/slurm-out/%test.out # Enter file path to write stdout to
#SBATCH -e /mnt/workspace/%u/slurm-out/%test.err # Enter file path to write stderr to

module load julia/1.10.2
julia script.jl
```

Expand Down

0 comments on commit e805677

Please sign in to comment.