Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any example script to run multi-node training for slurm? #1378

Open
wavy-jung opened this issue Jul 20, 2024 · 7 comments
Open

Any example script to run multi-node training for slurm? #1378

wavy-jung opened this issue Jul 20, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@wavy-jung
Copy link

Hi, I was trying to run multi-node training on slurm nodes but I have no idea how to configure composer arguments and commands.
Is there any example script to run training on slurm nodes with composer?

@wavy-jung wavy-jung added the enhancement New feature or request label Jul 20, 2024
@dakinggg
Copy link
Collaborator

dakinggg commented Jul 21, 2024

We don't have a slurm example, but here are the environment variables that the composer launcher sets/requires: https://github.com/mosaicml/composer/blob/6d4628a1043d1f118dc38eb359ede5524e0a9aa0/composer/cli/launcher.py#L344-L352. It should just be the normal torch distributed env vars.

And here are the env vars that mcli sets for you: https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/environment.html#runtime-environment-variables

@wavy-jung
Copy link
Author

wavy-jung commented Jul 22, 2024

Thanks for helping me! @dakinggg
I have configured the environment variables as you provided in the attached link and ran it.
Below is the script I ran for the training:

#!/bin/bash
#SBATCH --job-name=wavy-llmfoundry-test
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=8G
#SBATCH --gres=gpu:8
#SBATCH --output=slurm-logs/%x-%j.out

GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
MASTER_PORT=19963
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
WORK_DIR="/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/llm-foundry"

export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_LAUNCH_BLOCKING=1
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=INFO

export RANK=$NNODES
export WORLD_SIZE=$WORLD_SIZE
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT
export LOCAL_WORLD_SIZE=$GPUS_PER_NODE
export NUM_NODES=$NNODES


export LAUNCHER="composer --world_size $WORLD_SIZE \
    --master_addr $MASTER_ADDR \
    --master_port 19963"

export CMD="$WORK_DIR/scripts/train/train.py \
    $WORK_DIR/scripts/train/yamls/pretrain/llama3-8b.yaml"

srun \
--container-image /mnt/datafs/ib-a100-cluster-a-pri/lmt/images/wavy-llm-foundry-v0.10.0.sqsh \
--container-mounts /mnt/datafs:/mnt/datafs \
--container-workdir $WORK_DIR \
--jobid $SLURM_JOBID \
bash -c "export NODE_RANK=$SLURM_PROCID && $LAUNCHER --node_rank $SLURM_PROCID $CMD \
    save_folder=/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/checkpoints/composer/llama3-8b-slurm"

However, the error below was thrown:
image

So I tried with torchrun launcher and it passes the initialization stage, but stuck in the tokenizer building stage like below:

# export LAUNCHER="composer --world_size $WORLD_SIZE \
#     --master_addr $MASTER_ADDR \
#     --master_port 19963"

export LAUNCHER="torchrun \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d "

image
I would like to ask if there are any potential causes you can think of.
+) The yaml file used for the training is very similar to the example files and so I assume the problem is nothing to do with the yaml file!
+) sqsh image used for the training was built upon the latest version of docker image and added layer with pip install -e ".[gpu]" to setup the llmfoundry.

@dakinggg
Copy link
Collaborator

Ah looks like an issue on a shared fs (see #1253 (comment) for more discussion of this). I haven't quite finished fixing that yet.

@dakinggg
Copy link
Collaborator

dakinggg commented Jul 22, 2024

Could you try this PR? #1381. You may also need composer with this pr mosaicml/composer#3485

@wavy-jung
Copy link
Author

@dakinggg Thanks! I'll try with those PRs

@dmakhervaks
Copy link

dmakhervaks commented Jul 25, 2024

@dakinggg It seems that 1381 was reverted -> 221d3e2

I tried pulling the latest docker image (mosaicml/llm-foundry:2.3.1_cu121-e882658) but I am still getting this error when trying to run in a multi-node setting:

[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank7]: Last error:
[rank7]: socketStartConnect: Connect to 172.20.1.119<43385> failed : Software caused connection abort

Is this expected? Thanks in advance!

@dakinggg
Copy link
Collaborator

Yes, we will reapply soon, but you can still try with that PR. unhandled error seems different though and suggest your distributed env is not set up correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants