Llama3 8B with QLoRA ends in: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on #2713

oroojlooy · 2024-04-25T21:56:44Z

System Info

torch==2.1.2
transformers==4.37.2
accelerate==0.26.1
numpy==1.26.3
OS type and version: fedora, 7.9
Python version: 3.11.0

I am using 8*A100-40Gb. 

Related bug: [1515](https://github.com/huggingface/accelerate/issues/1515)

I also tried the same code with brand new env, with the latest versions of every packages. The same error happens there.

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I am calling an extension of QLoRA with meta-llama/Meta-Llama-3-8B-Instruct. I am using an extended version of https://github.com/jondurbin/qlora.git. The main parameters to set are:

--model_name_or_path  meta-llama/Meta-Llama-3-8B-Instruct   --do_train   --lora_modules all   --bf16   --bits 4   --double_quant   --quant_type nf4

Expected behavior

The model should start fine-tuning, but it looks like that accelerate reserves most memory of cuda:0 for some future operation and as a result the remaining memory is not enough for loading any part of the model. As a result, when in accelerate/accelerator.py:1329 it checks for

# if on the first device (GPU 0) we don't care
if (self.device.index is not None) or (current_device_index != 0)`

it founds that current_device_index != 0 and it throws the error:

raise ValueError(
                        "You can't train a model that has been loaded in 8-bit precision on a different device than the one "
                        "you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}"
                    )

To add more details, I set verbose=True in accelerate/utils/modeling.py:infer_auto_device_map() where at while len(modules_to_treat) > 0: which designs the device_map. It returns the following parameters allocation:

Not enough space on 0 to put model (space available 167116948.79999995, module size 7572035584).
Splitting model.

Treating module model.embed_tokens.
Not enough space on 0 to put model.embed_tokens (space available 167116948.79999995, module size 525336576).
This module cannot be split, going to the next device.

Treating module model.embed_tokens.
Putting model.embed_tokens (size=525336576) on 1 (available=1217790100.8).

Treating module model.layers.
Not enough space on 1 to put model.layers (space available 692453524.8, module size 7046694912).
Splitting model.layers.

Treating module model.layers.0.
Putting model.layers.0 (size=220209216) on 1 (available=692453524.8).

Treating module model.layers.1.
Putting model.layers.1 (size=220209216) on 1 (available=472244308.79999995).

Treating module model.layers.2.
Putting model.layers.2 (size=220209216) on 1 (available=252035092.79999995).

Treating module model.layers.3.
Not enough space on 1 to put model.layers.3 (space available 31825876.799999952, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.3.
Putting model.layers.3 (size=220209216) on 2 (available=1217790100.8).

Treating module model.layers.4.
Putting model.layers.4 (size=220209216) on 2 (available=997580884.8).

Treating module model.layers.5.
Putting model.layers.5 (size=220209216) on 2 (available=777371668.8).

Treating module model.layers.6.
Putting model.layers.6 (size=220209216) on 2 (available=557162452.8).

Treating module model.layers.7.
Putting model.layers.7 (size=220209216) on 2 (available=336953236.79999995).

Treating module model.layers.8.
Not enough space on 2 to put model.layers.8 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.8.
Putting model.layers.8 (size=220209216) on 3 (available=1217790100.8).

Treating module model.layers.9.
Putting model.layers.9 (size=220209216) on 3 (available=997580884.8).

Treating module model.layers.10.
Putting model.layers.10 (size=220209216) on 3 (available=777371668.8).

Treating module model.layers.11.
Putting model.layers.11 (size=220209216) on 3 (available=557162452.8).

Treating module model.layers.12.
Putting model.layers.12 (size=220209216) on 3 (available=336953236.79999995).

Treating module model.layers.13.
Not enough space on 3 to put model.layers.13 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.13.
Putting model.layers.13 (size=220209216) on 4 (available=1217790100.8).

Treating module model.layers.14.
Putting model.layers.14 (size=220209216) on 4 (available=997580884.8).

Treating module model.layers.15.
Putting model.layers.15 (size=220209216) on 4 (available=777371668.8).

Treating module model.layers.16.
Putting model.layers.16 (size=220209216) on 4 (available=557162452.8).

Treating module model.layers.17.
Putting model.layers.17 (size=220209216) on 4 (available=336953236.79999995).

Treating module model.layers.18.
Not enough space on 4 to put model.layers.18 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.18.
Putting model.layers.18 (size=220209216) on 5 (available=1217790100.8).

Treating module model.layers.19.
Putting model.layers.19 (size=220209216) on 5 (available=997580884.8).

Treating module model.layers.20.
Putting model.layers.20 (size=220209216) on 5 (available=777371668.8).

Treating module model.layers.21.
Putting model.layers.21 (size=220209216) on 5 (available=557162452.8).

Treating module model.layers.22.
Putting model.layers.22 (size=220209216) on 5 (available=336953236.79999995).

Treating module model.layers.23.
Not enough space on 5 to put model.layers.23 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.23.
Putting model.layers.23 (size=220209216) on 6 (available=1217790100.8).

Treating module model.layers.24.
Putting model.layers.24 (size=220209216) on 6 (available=997580884.8).

Treating module model.layers.25.
Putting model.layers.25 (size=220209216) on 6 (available=777371668.8).

Treating module model.layers.26.
Putting model.layers.26 (size=220209216) on 6 (available=557162452.8).

Treating module model.layers.27.
Putting model.layers.27 (size=220209216) on 6 (available=336953236.79999995).

Treating module model.layers.28.
Not enough space on 6 to put model.layers.28 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.28.
Putting model.layers.28 (size=220209216) on 7 (available=72000000000.0).

Treating module model.layers.29.
Putting model.layers.29 (size=220209216) on 7 (available=71779790784.0).

Treating module model.layers.30.
Putting model.layers.30 (size=220209216) on 7 (available=71559581568.0).

Treating module model.layers.31.
Putting model.layers.31 (size=220209216) on 7 (available=71339372352.0).

Treating module model.norm.
Putting model.norm (size=4096) on 7 (available=71119163136.0).

Treating module lm_head.
Putting lm_head (size=1050673152) on 7 (available=71119159040.0).

After this step, I get some logs and the mentioned error:

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:14<00:00,  3.61s/it]
adding LoRA modules...
loaded model
Filter: 100%|█████████████████████████| 427/427 [00:02<00:00, 168.65 examples/s]
Detected kernel version 5.4.17, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
trainable params: 167772160 || all params: 8198557696 || trainable: 2.046361887309209
torch.bfloat16 1051463680 0.12824983600627698
torch.int8 6979321856 0.851286545120631
torch.float32 167772160 0.02046361887309209
Traceback (most recent call last):
  File "/home/afshin/.pycharm_helpers/pydev/pydevd.py", line 1496, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/afshin/codes/qlora/train.py", line 1490, in <module>
    train()
  File "/home/afshin/codes/qlora/train.py", line 1393, in train
    train_result = trainer.train()
                   ^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
             ^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1330, in prepare_model
    raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

=====================================================================================
If I add a manual device_map like below, it works as expected:

    device_map = {}
    device_map['model.embed_tokens'] = 0
    device_map['model.layers.0'] = 1
    device_map['model.layers.1'] = 1
    device_map['model.layers.2'] = 1
    device_map['model.layers.3'] = 2
    device_map['model.layers.4'] = 2
    device_map['model.layers.5'] = 2
    device_map['model.layers.6'] = 2
    device_map['model.layers.7'] = 2
    device_map['model.layers.8'] = 3
    device_map['model.layers.9'] = 3
    device_map['model.layers.10'] = 3
    device_map['model.layers.11'] = 3
    device_map['model.layers.12'] = 3
    device_map['model.layers.13'] = 4
    device_map['model.layers.14'] = 4
    device_map['model.layers.15'] = 4
    device_map['model.layers.16'] = 4
    device_map['model.layers.17'] = 4
    device_map['model.layers.18'] = 5
    device_map['model.layers.19'] = 5
    device_map['model.layers.20'] = 5
    device_map['model.layers.21'] = 5
    device_map['model.layers.22'] = 5
    device_map['model.layers.23'] = 6
    device_map['model.layers.24'] = 6
    device_map['model.layers.25'] = 6
    device_map['model.layers.26'] = 6
    device_map['model.layers.27'] = 6
    device_map['model.layers.28'] = 7
    device_map['model.layers.29'] = 7
    device_map['model.layers.30'] = 7
    device_map['model.layers.31'] = 7
    device_map['model.norm'] = 7
    device_map['lm_head'] = 7

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-04-25T22:23:51Z

cc @SunMarc

SunMarc mentioned this issue Apr 26, 2024

fix bnb multi gpu training #2714

Merged

SunMarc closed this as completed in #2714 Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 8B with QLoRA ends in: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on #2713

Llama3 8B with QLoRA ends in: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on #2713

oroojlooy commented Apr 25, 2024 •

edited

Loading

muellerzr commented Apr 25, 2024

Llama3 8B with QLoRA ends in: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on #2713

Llama3 8B with QLoRA ends in: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on #2713

Comments

oroojlooy commented Apr 25, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Apr 25, 2024

oroojlooy commented Apr 25, 2024 •

edited

Loading