Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run on a single GPU? #28

Open
xbzjsj opened this issue Apr 17, 2024 · 7 comments
Open

How to run on a single GPU? #28

xbzjsj opened this issue Apr 17, 2024 · 7 comments

Comments

@xbzjsj
Copy link

xbzjsj commented Apr 17, 2024

How to run on a single GPU? The codes run with 8 GPUs and use distributed training. I can't find a single GPU interface(no sh file for one GPU, and no single GPU run command line).

@xbzjsj
Copy link
Author

xbzjsj commented Apr 17, 2024

Will the use of single GPU run existing overflow?

@ariellubonja
Copy link

Hi! You can adapt to use 1 gpu like so:

Replace something that looks like this:

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \
    & \
wait)

With this:

python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0

@2455DD
Copy link

2455DD commented May 20, 2024

Hi! You can adapt to use 1 gpu like so:

Replace something that looks like this:

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \
    & \
wait)

With this:

python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0

and don't forget to change world-size and pipeline-group-size

@rhmaaa
Copy link

rhmaaa commented Jul 25, 2024

Hi! You can adapt to use 1 gpu like so:
Replace something that looks like this:

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \
    & \
wait)

With this:
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0

and don't forget to change world-size and pipeline-group-size

It still gives some errors related to NCCL.

@rhmaaa
Copy link

rhmaaa commented Jul 25, 2024

file=./c4_train/c4_train.jsonl
    
echo "start running ${file}"

# ARGS="--model-name /lustre/fsw/nvresearch/ldm/diffusion/checkpoint/opt-175b-new \
ARGS="--model-name /root/shared/opt-30b-new \
--model-type opt \
--seed 42 \
--fp16 \
--num-layers 24 \
--max-layers 48 \
--budget 22800 \
--num-iters 2000 \
--dist-url tcp://127.0.0.1:9032 \
--token-micro-batch-size 1 \
--world-size 2 --pipeline-group-size 2 --data-group-size 1 \
--pp-mode pipe_sync_sample_mask_token_pipe \
--infer-data ${file}"

(trap 'kill 0' SIGINT; \
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    &
python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
    &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \
#     &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \
#     &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \
#     &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \
#     &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \
#     &
# python dist_inference_runner.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \
#     & \
wait)

@rhmaaa
Copy link

rhmaaa commented Jul 25, 2024

@2455DD @ariellubonja, could you give me some ideas? I'm still encountering some NCCL-related errors.

@ariellubonja
Copy link

Hi @rhmaaa ! Can you try to re-create the Docker container from the Dockerfile? It sounds like a library is missing-type error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants