Use all gpus available for training #293

Abecid · 2023-05-19T08:16:12Z

Use all gpus available for training

finetune/adapter.py

awaelchli

Thanks for the suggestion @Abecid. In general the changes look good but it won't make much difference because we are keeping the batch size constant globally to keep it easy to reproduce results. In fact, if training fits in a single GPU and torch.cuda.device_count() selects more devices, it may actually be slower because of the reduced per-device batch size and communication overhead.

I vote for keeping the values as they are and instead advise users to scale both devices and batch size to their setup to maximize efficiency. We could add this info to the how-to guides. Let me know what you think.

awaelchli · 2023-05-20T12:44:51Z

finetune/full.py

@@ -31,7 +31,7 @@
 save_interval = 1000
 eval_iters = 100
 log_interval = 100
-devices = 4
+devices = torch.cuda.device_count()


Suggested change

devices = torch.cuda.device_count()

devices = "auto"

awaelchli · 2023-05-20T12:45:54Z

finetune/lora.py

+    devices = torch.cuda.device_count()
+    fabric = L.Fabric(accelerator="cuda", devices=devices, precision="bf16-true")


Suggested change

devices = torch.cuda.device_count()

fabric = L.Fabric(accelerator="cuda", devices=devices, precision="bf16-true")

fabric = L.Fabric(accelerator="cuda", devices="auto", precision="bf16-true")

@Abecid Did multi-gpu training work with this script and how many did you use? Is the loss convergence comparable to single gpu training?

i ran this with two NVIDIA Tesla T4 cards with devices="auto" and got this error:

RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe. This may indicate a possible application crash on rank 0 or a network set up issue.

Multi GPU Training

7c781d7

Abecid requested review from awaelchli, carmocca and lantiga as code owners May 19, 2023 08:16

carmocca reviewed May 19, 2023

View reviewed changes

finetune/adapter.py Show resolved Hide resolved

awaelchli reviewed May 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use all gpus available for training #293

Use all gpus available for training #293

Abecid commented May 19, 2023

awaelchli left a comment

awaelchli May 20, 2023

awaelchli May 20, 2023

awaelchli May 20, 2023

juliosanz May 23, 2023

		devices = torch.cuda.device_count()
		fabric = L.Fabric(accelerator="cuda", devices=devices, precision="bf16-true")

Use all gpus available for training #293

Are you sure you want to change the base?

Use all gpus available for training #293

Conversation

Abecid commented May 19, 2023

awaelchli left a comment

Choose a reason for hiding this comment

awaelchli May 20, 2023

Choose a reason for hiding this comment

awaelchli May 20, 2023

Choose a reason for hiding this comment

awaelchli May 20, 2023

Choose a reason for hiding this comment

juliosanz May 23, 2023

Choose a reason for hiding this comment