You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Llama3 8B with QLoRA ends in: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on
#2713
Closed
1 of 4 tasks
oroojlooy opened this issue
Apr 25, 2024
· 1 comment
· Fixed by #2714
torch==2.1.2
transformers==4.37.2
accelerate==0.26.1
numpy==1.26.3
OS type and version: fedora, 7.9
Python version: 3.11.0
I am using 8*A100-40Gb.
Related bug: [1515](https://github.com/huggingface/accelerate/issues/1515)
I also tried the same code with brand new env, with the latest versions of every packages. The same error happens there.
Information
The official example scripts
My own modified scripts
Tasks
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
I am calling an extension of QLoRA with meta-llama/Meta-Llama-3-8B-Instruct. I am using an extended version of https://github.com/jondurbin/qlora.git. The main parameters to set are:
The model should start fine-tuning, but it looks like that accelerate reserves most memory of cuda:0 for some future operation and as a result the remaining memory is not enough for loading any part of the model. As a result, when in accelerate/accelerator.py:1329 it checks for
# if on the first device (GPU 0) we don't care
if (self.device.index is not None) or (current_device_index != 0)`
it founds that current_device_index != 0 and it throws the error:
raise ValueError(
"You can't train a model that has been loaded in 8-bit precision on a different device than the one "
"you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}"
)
To add more details, I set verbose=True in accelerate/utils/modeling.py:infer_auto_device_map() where at while len(modules_to_treat) > 0: which designs the device_map. It returns the following parameters allocation:
Not enough space on 0 to put model (space available 167116948.79999995, module size 7572035584).
Splitting model.
Treating module model.embed_tokens.
Not enough space on 0 to put model.embed_tokens (space available 167116948.79999995, module size 525336576).
This module cannot be split, going to the next device.
Treating module model.embed_tokens.
Putting model.embed_tokens (size=525336576) on 1 (available=1217790100.8).
Treating module model.layers.
Not enough space on 1 to put model.layers (space available 692453524.8, module size 7046694912).
Splitting model.layers.
Treating module model.layers.0.
Putting model.layers.0 (size=220209216) on 1 (available=692453524.8).
Treating module model.layers.1.
Putting model.layers.1 (size=220209216) on 1 (available=472244308.79999995).
Treating module model.layers.2.
Putting model.layers.2 (size=220209216) on 1 (available=252035092.79999995).
Treating module model.layers.3.
Not enough space on 1 to put model.layers.3 (space available 31825876.799999952, module size 220209216).
This module cannot be split, going to the next device.
Treating module model.layers.3.
Putting model.layers.3 (size=220209216) on 2 (available=1217790100.8).
Treating module model.layers.4.
Putting model.layers.4 (size=220209216) on 2 (available=997580884.8).
Treating module model.layers.5.
Putting model.layers.5 (size=220209216) on 2 (available=777371668.8).
Treating module model.layers.6.
Putting model.layers.6 (size=220209216) on 2 (available=557162452.8).
Treating module model.layers.7.
Putting model.layers.7 (size=220209216) on 2 (available=336953236.79999995).
Treating module model.layers.8.
Not enough space on 2 to put model.layers.8 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.
Treating module model.layers.8.
Putting model.layers.8 (size=220209216) on 3 (available=1217790100.8).
Treating module model.layers.9.
Putting model.layers.9 (size=220209216) on 3 (available=997580884.8).
Treating module model.layers.10.
Putting model.layers.10 (size=220209216) on 3 (available=777371668.8).
Treating module model.layers.11.
Putting model.layers.11 (size=220209216) on 3 (available=557162452.8).
Treating module model.layers.12.
Putting model.layers.12 (size=220209216) on 3 (available=336953236.79999995).
Treating module model.layers.13.
Not enough space on 3 to put model.layers.13 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.
Treating module model.layers.13.
Putting model.layers.13 (size=220209216) on 4 (available=1217790100.8).
Treating module model.layers.14.
Putting model.layers.14 (size=220209216) on 4 (available=997580884.8).
Treating module model.layers.15.
Putting model.layers.15 (size=220209216) on 4 (available=777371668.8).
Treating module model.layers.16.
Putting model.layers.16 (size=220209216) on 4 (available=557162452.8).
Treating module model.layers.17.
Putting model.layers.17 (size=220209216) on 4 (available=336953236.79999995).
Treating module model.layers.18.
Not enough space on 4 to put model.layers.18 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.
Treating module model.layers.18.
Putting model.layers.18 (size=220209216) on 5 (available=1217790100.8).
Treating module model.layers.19.
Putting model.layers.19 (size=220209216) on 5 (available=997580884.8).
Treating module model.layers.20.
Putting model.layers.20 (size=220209216) on 5 (available=777371668.8).
Treating module model.layers.21.
Putting model.layers.21 (size=220209216) on 5 (available=557162452.8).
Treating module model.layers.22.
Putting model.layers.22 (size=220209216) on 5 (available=336953236.79999995).
Treating module model.layers.23.
Not enough space on 5 to put model.layers.23 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.
Treating module model.layers.23.
Putting model.layers.23 (size=220209216) on 6 (available=1217790100.8).
Treating module model.layers.24.
Putting model.layers.24 (size=220209216) on 6 (available=997580884.8).
Treating module model.layers.25.
Putting model.layers.25 (size=220209216) on 6 (available=777371668.8).
Treating module model.layers.26.
Putting model.layers.26 (size=220209216) on 6 (available=557162452.8).
Treating module model.layers.27.
Putting model.layers.27 (size=220209216) on 6 (available=336953236.79999995).
Treating module model.layers.28.
Not enough space on 6 to put model.layers.28 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.
Treating module model.layers.28.
Putting model.layers.28 (size=220209216) on 7 (available=72000000000.0).
Treating module model.layers.29.
Putting model.layers.29 (size=220209216) on 7 (available=71779790784.0).
Treating module model.layers.30.
Putting model.layers.30 (size=220209216) on 7 (available=71559581568.0).
Treating module model.layers.31.
Putting model.layers.31 (size=220209216) on 7 (available=71339372352.0).
Treating module model.norm.
Putting model.norm (size=4096) on 7 (available=71119163136.0).
Treating module lm_head.
Putting lm_head (size=1050673152) on 7 (available=71119159040.0).
After this step, I get some logs and the mentioned error:
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:14<00:00, 3.61s/it]
adding LoRA modules...
loaded model
Filter: 100%|█████████████████████████| 427/427 [00:02<00:00, 168.65 examples/s]
Detected kernel version 5.4.17, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
trainable params: 167772160 || all params: 8198557696 || trainable: 2.046361887309209
torch.bfloat16 1051463680 0.12824983600627698
torch.int8 6979321856 0.851286545120631
torch.float32 167772160 0.02046361887309209
Traceback (most recent call last):
File "/home/afshin/.pycharm_helpers/pydev/pydevd.py", line 1496, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/afshin/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/afshin/codes/qlora/train.py", line 1490, in <module>
train()
File "/home/afshin/codes/qlora/train.py", line 1393, in train
train_result = trainer.train()
^^^^^^^^^^^^^^^
File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/transformers/trainer.py", line 1687, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1227, in prepare
result = tuple(
^^^^^^
File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1330, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
=====================================================================================
If I add a manual device_map like below, it works as expected:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I am calling an extension of QLoRA with
meta-llama/Meta-Llama-3-8B-Instruct
. I am using an extended version ofhttps://github.com/jondurbin/qlora.git
. The main parameters to set are:Expected behavior
The model should start fine-tuning, but it looks like that accelerate reserves most memory of
cuda:0
for some future operation and as a result the remaining memory is not enough for loading any part of the model. As a result, when inaccelerate/accelerator.py:1329
it checks forit founds that
current_device_index != 0
and it throws the error:To add more details, I set
verbose=True
inaccelerate/utils/modeling.py:infer_auto_device_map()
where atwhile len(modules_to_treat) > 0:
which designs thedevice_map
. It returns the following parameters allocation:After this step, I get some logs and the mentioned error:
=====================================================================================
If I add a manual
device_map
like below, it works as expected:The text was updated successfully, but these errors were encountered: