-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL only multi-gpu multi-node training without MPI #426
base: master
Are you sure you want to change the base?
Conversation
Very cool! I'll take a look and also see if I can find a slurm cluster to play with this on.
Notice in particular the use of |
Thanks for the suggestions. I didn’t take in grad accumulation changes yet. Waiting for the updated numbers. |
@karpathy I feel like the total num tokens should be more than 0.5M. GPT2 claims they used 256 GPUs, context len 1024. |
I thought OpenMPI supports Slurm: https://docs.open-mpi.org/en/main/launching-apps/slurm.html Can you give some insight on why it didn't work for you? |
@PeterZhizhin These are the issues I'm facing when I try to dispatch MPI dependent jobs using Slurm.
Generally in cluster setup, slurm is used as nothing but a tool to dispatch process across nodes. PMIx support is an additional support afaiu. Specially when we look at torchrun, lightning, and accelerate, per node one main process get launched. Using python mp, main process spawn N number of child process per GPU. Afaiu these frameworks run without any job scheduler dependencies. Ideally we should remove MPI and SLURM dependencies. Then we should implement a similar socket or some other, server client interface to synchronize the process group independently. We will be able to implement additional features like fault tolerance if we do that. (If one gpu or nodes goes down, current training hangs forever) But for us, since we are not there yet, we need to have MPI or SLURM dependencies. Upto discussions. |
@chinthysl thank you! Something that we can also do is instead of relying on MPI for single-node is to just spawn processes via |
train_gpt2.cu
Outdated
|
||
if (multi_gpu_config->slurm_managed) { //If the process is managed by slurm we don't need to use MPI | ||
if (d_buffer == NULL) cudaCheck(cudaMalloc(&d_buffer, sizeof(float))); | ||
cudaCheck(cudaMemcpy(d_buffer, &value, sizeof(float), cudaMemcpyHostToDevice)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I was asking if we could use MPI on Slurm clusters is because of this.
This feels a bit ugly, while a single MPI_AllReduce
is a lot cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this one looks not great. Also thinking about alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karpathy @PeterZhizhin I introduced cuda unified mem buffer with cudaPrefetchAsync. But still internally additional host to device and device to host memcopies happen. So MPI is cleaner in that sense.
On the other hand I'm thinking about a cleaner way to get rid of MPI and SLURM dependencies by introducing an independent server client interface to manage distributed stuff.
d5e4477
to
c00738c
Compare
Currently doing multi-node testing on a slurm cluster using fineweb 10b for now; training is being done on two gpu2 nodes (2xA100 80gb per node). I am getting this error, stopping the job:
A complete log file for a job can be found below.
I added
Note: I set A sbatch script for the two node job can be found below. I plan to eventually do a twelve node (2xA100 80gb) job using fineweb 100b.
|
233f11d
to
b4fd422
Compare
@0xAlita I added the alternative to MPI_Barrier issue. It should work now. Please try without --mpi=pmix if you are using this branch. |
@karpathy Interesting observation here. I was able get some numbers from multi-node training for the d48 model. Training was done in DGX H100's. Sample slurm run - llmc
Sample slurm run - pytorch
|
b4fd422
to
e88f901
Compare
I removed MPI dependencies and tested for performance changes. There's no visible performance changes between master and this branch. |
751bec9
to
100bba0
Compare
@karpathy I was able to train 1.5B model using 59 nodes of DGX H100 using FineWeb350B. |
hey @chinthysl do you have the ckpt somewhere? :) |
Thank you for posting @chinthysl , very cool. We had a small discussion about it on our Discord with the core devs, please join us sometime on the CUDA MODE Discord, cheers! |
ncclCheck(ncclGetUniqueId(&nccl_id)); | ||
idFile = fopen(filename, "wb"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fopenCheck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, all file ops here should use the checked versions instead
#ifdef MULTI_GPU | ||
if (multi_gpu_config->num_processes > 1) { | ||
mpiCheck(MPI_Barrier(MPI_COMM_WORLD)); | ||
if (unified_buffer == NULL) cudaCheck(cudaMallocManaged(&unified_buffer, sizeof(float))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if a "barrier" function should have any business conditionally allocating memory that needs to be manually freed later on
cudaStream_t nccl_stream; // CUDA Stream to perform NCCL operations. | ||
cudaEvent_t compute_nccl_sync; // Event used to synchronize NCCL with the compute | ||
#endif | ||
} MultiGpuConfig; | ||
|
||
MultiGpuConfig multi_gpu_config_init(int num_processes, int process_rank, int gpus_per_node, char *dfs_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the other code handle different number of GPUs per node? Not sure how often that is realistic, but if so, it should at least be a conscious decision to remove that functionality
result.num_processes = num_processes; | ||
result.device_idx = process_rank % gpus_per_node; | ||
|
||
FILE* idFile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this file get cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No!
@@ -21,7 +21,7 @@ Long story short, try `-r 1` (recompute GeLU, trading off speed and memory) to c | |||
It might be that you only have one GPU and not a whole box of them. Every script is fairly easy to change for just a single GPU. For llm.c, simply change line 1 to line 2 and leave everything else the same: | |||
|
|||
```bash | |||
mpirun -np 8 ./train_gpt2cu \ | |||
mpirun -np 8 bach -c './train_gpt2cu -pn 8 -pr $OMPI_COMM_WORLD_RANK' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bach? bash?
|
||
if (unified_buffer == NULL) cudaCheck(cudaMallocManaged(&unified_buffer, sizeof(float))); | ||
*unified_buffer = value; | ||
cudaCheck(cudaMemPrefetchAsync(unified_buffer, sizeof(float), multi_gpu_config->device_idx, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this prefetch actually give any advantage, given that the access already happens on the next line?
for (int i = 1; i < argc; i+=2) { | ||
if (i + 1 >= argc) { error_usage(); } // must have arg after flag | ||
if (argv[i][0] != '-') { error_usage(); } // must start with dash | ||
if (strlen(argv[i]) != 2) { error_usage(); } // must be -x (one dash, one letter) | ||
if (!(strlen(argv[i]) == 2 || strlen(argv[i]) == 3)) { error_usage(); } // must be -x (one dash, one letter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also argparse docs above would be)
@@ -36,7 +37,9 @@ while true; do | |||
-u 700 \ | |||
-n 5000 \ | |||
-y 1 \ | |||
-e "d12" | |||
-e "d12" \ | |||
-pn 8 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMPI_COMM_WORLD_SIZE maybe?
@@ -325,6 +325,7 @@ typedef struct { | |||
int* targets; // the target tokens for the current forward pass | |||
float mean_loss; // after a forward pass with targets, will be populated with the mean loss | |||
float accumulated_mean_loss; // Mean loss after aggregating it on all GPUs | |||
float* unified_buffer; // GPU buffer to avg loss across process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seem like unified buffer should be part of model, possible for it to maybe live "outside" inside int main or as global inside zero.cuh file or so?
Scheduling jobs using Slurm seems much easier in a multi-node training setup compared to setting up MPI for the cluster.
This draft contains the changes to use mpirun for single-node training and Slurm for multi-node training.
PyTorch uses one of the backends from Gloo, MPI, and NCCL for DDP. Maybe we don't need to use both MPI and NCCL together. It should be either CUDA-aware MPI or NCCL. Something to discuss further.
I got some interesting performance numbers in a large-scale training setup using llm.c.
I used the Ahrefs DGX H100 (80GB) Superpod. The cluster uses NVLINK for intra-node communications and InfiniBand RDMA for inter-node communications.
Number are taken just before #421 from @ngc92. Current master should have higher tokens/sec with less latency.
Without cuDNN we can reach up to batch_size=12
With cuDNN we can reach up to batch_size=24