Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3 8B with QLoRA ends in: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on #2713

Closed
1 of 4 tasks
oroojlooy opened this issue Apr 25, 2024 · 1 comment · Fixed by #2714

Comments

@oroojlooy
Copy link

oroojlooy commented Apr 25, 2024

System Info

torch==2.1.2
transformers==4.37.2
accelerate==0.26.1
numpy==1.26.3
OS type and version: fedora, 7.9
Python version: 3.11.0

I am using 8*A100-40Gb. 

Related bug: [1515](https://github.com/huggingface/accelerate/issues/1515)

I also tried the same code with brand new env, with the latest versions of every packages. The same error happens there.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I am calling an extension of QLoRA with meta-llama/Meta-Llama-3-8B-Instruct. I am using an extended version of https://github.com/jondurbin/qlora.git. The main parameters to set are:

--model_name_or_path  meta-llama/Meta-Llama-3-8B-Instruct   --do_train   --lora_modules all   --bf16   --bits 4   --double_quant   --quant_type nf4 

Expected behavior

The model should start fine-tuning, but it looks like that accelerate reserves most memory of cuda:0 for some future operation and as a result the remaining memory is not enough for loading any part of the model. As a result, when in accelerate/accelerator.py:1329 it checks for

# if on the first device (GPU 0) we don't care
if (self.device.index is not None) or (current_device_index != 0)`

it founds that current_device_index != 0 and it throws the error:

raise ValueError(
                        "You can't train a model that has been loaded in 8-bit precision on a different device than the one "
                        "you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}"
                    )

To add more details, I set verbose=True in accelerate/utils/modeling.py:infer_auto_device_map() where at while len(modules_to_treat) > 0: which designs the device_map. It returns the following parameters allocation:

Not enough space on 0 to put model (space available 167116948.79999995, module size 7572035584).
Splitting model.

Treating module model.embed_tokens.
Not enough space on 0 to put model.embed_tokens (space available 167116948.79999995, module size 525336576).
This module cannot be split, going to the next device.

Treating module model.embed_tokens.
Putting model.embed_tokens (size=525336576) on 1 (available=1217790100.8).

Treating module model.layers.
Not enough space on 1 to put model.layers (space available 692453524.8, module size 7046694912).
Splitting model.layers.

Treating module model.layers.0.
Putting model.layers.0 (size=220209216) on 1 (available=692453524.8).

Treating module model.layers.1.
Putting model.layers.1 (size=220209216) on 1 (available=472244308.79999995).

Treating module model.layers.2.
Putting model.layers.2 (size=220209216) on 1 (available=252035092.79999995).

Treating module model.layers.3.
Not enough space on 1 to put model.layers.3 (space available 31825876.799999952, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.3.
Putting model.layers.3 (size=220209216) on 2 (available=1217790100.8).

Treating module model.layers.4.
Putting model.layers.4 (size=220209216) on 2 (available=997580884.8).

Treating module model.layers.5.
Putting model.layers.5 (size=220209216) on 2 (available=777371668.8).

Treating module model.layers.6.
Putting model.layers.6 (size=220209216) on 2 (available=557162452.8).

Treating module model.layers.7.
Putting model.layers.7 (size=220209216) on 2 (available=336953236.79999995).

Treating module model.layers.8.
Not enough space on 2 to put model.layers.8 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.8.
Putting model.layers.8 (size=220209216) on 3 (available=1217790100.8).

Treating module model.layers.9.
Putting model.layers.9 (size=220209216) on 3 (available=997580884.8).

Treating module model.layers.10.
Putting model.layers.10 (size=220209216) on 3 (available=777371668.8).

Treating module model.layers.11.
Putting model.layers.11 (size=220209216) on 3 (available=557162452.8).

Treating module model.layers.12.
Putting model.layers.12 (size=220209216) on 3 (available=336953236.79999995).

Treating module model.layers.13.
Not enough space on 3 to put model.layers.13 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.13.
Putting model.layers.13 (size=220209216) on 4 (available=1217790100.8).

Treating module model.layers.14.
Putting model.layers.14 (size=220209216) on 4 (available=997580884.8).

Treating module model.layers.15.
Putting model.layers.15 (size=220209216) on 4 (available=777371668.8).

Treating module model.layers.16.
Putting model.layers.16 (size=220209216) on 4 (available=557162452.8).

Treating module model.layers.17.
Putting model.layers.17 (size=220209216) on 4 (available=336953236.79999995).

Treating module model.layers.18.
Not enough space on 4 to put model.layers.18 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.18.
Putting model.layers.18 (size=220209216) on 5 (available=1217790100.8).

Treating module model.layers.19.
Putting model.layers.19 (size=220209216) on 5 (available=997580884.8).

Treating module model.layers.20.
Putting model.layers.20 (size=220209216) on 5 (available=777371668.8).

Treating module model.layers.21.
Putting model.layers.21 (size=220209216) on 5 (available=557162452.8).

Treating module model.layers.22.
Putting model.layers.22 (size=220209216) on 5 (available=336953236.79999995).

Treating module model.layers.23.
Not enough space on 5 to put model.layers.23 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.23.
Putting model.layers.23 (size=220209216) on 6 (available=1217790100.8).

Treating module model.layers.24.
Putting model.layers.24 (size=220209216) on 6 (available=997580884.8).

Treating module model.layers.25.
Putting model.layers.25 (size=220209216) on 6 (available=777371668.8).

Treating module model.layers.26.
Putting model.layers.26 (size=220209216) on 6 (available=557162452.8).

Treating module model.layers.27.
Putting model.layers.27 (size=220209216) on 6 (available=336953236.79999995).

Treating module model.layers.28.
Not enough space on 6 to put model.layers.28 (space available 116744020.79999995, module size 220209216).
This module cannot be split, going to the next device.

Treating module model.layers.28.
Putting model.layers.28 (size=220209216) on 7 (available=72000000000.0).

Treating module model.layers.29.
Putting model.layers.29 (size=220209216) on 7 (available=71779790784.0).

Treating module model.layers.30.
Putting model.layers.30 (size=220209216) on 7 (available=71559581568.0).

Treating module model.layers.31.
Putting model.layers.31 (size=220209216) on 7 (available=71339372352.0).

Treating module model.norm.
Putting model.norm (size=4096) on 7 (available=71119163136.0).

Treating module lm_head.
Putting lm_head (size=1050673152) on 7 (available=71119159040.0).

After this step, I get some logs and the mentioned error:

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:14<00:00,  3.61s/it]
adding LoRA modules...
loaded model
Filter: 100%|█████████████████████████| 427/427 [00:02<00:00, 168.65 examples/s]
Detected kernel version 5.4.17, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
trainable params: 167772160 || all params: 8198557696 || trainable: 2.046361887309209
torch.bfloat16 1051463680 0.12824983600627698
torch.int8 6979321856 0.851286545120631
torch.float32 167772160 0.02046361887309209
Traceback (most recent call last):
  File "/home/afshin/.pycharm_helpers/pydev/pydevd.py", line 1496, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/afshin/codes/qlora/train.py", line 1490, in <module>
    train()
  File "/home/afshin/codes/qlora/train.py", line 1393, in train
    train_result = trainer.train()
                   ^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
             ^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afshin/miniconda3/envs/mixtral/lib/python3.11/site-packages/accelerate/accelerator.py", line 1330, in prepare_model
    raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

=====================================================================================
If I add a manual device_map like below, it works as expected:

    device_map = {}
    device_map['model.embed_tokens'] = 0
    device_map['model.layers.0'] = 1
    device_map['model.layers.1'] = 1
    device_map['model.layers.2'] = 1
    device_map['model.layers.3'] = 2
    device_map['model.layers.4'] = 2
    device_map['model.layers.5'] = 2
    device_map['model.layers.6'] = 2
    device_map['model.layers.7'] = 2
    device_map['model.layers.8'] = 3
    device_map['model.layers.9'] = 3
    device_map['model.layers.10'] = 3
    device_map['model.layers.11'] = 3
    device_map['model.layers.12'] = 3
    device_map['model.layers.13'] = 4
    device_map['model.layers.14'] = 4
    device_map['model.layers.15'] = 4
    device_map['model.layers.16'] = 4
    device_map['model.layers.17'] = 4
    device_map['model.layers.18'] = 5
    device_map['model.layers.19'] = 5
    device_map['model.layers.20'] = 5
    device_map['model.layers.21'] = 5
    device_map['model.layers.22'] = 5
    device_map['model.layers.23'] = 6
    device_map['model.layers.24'] = 6
    device_map['model.layers.25'] = 6
    device_map['model.layers.26'] = 6
    device_map['model.layers.27'] = 6
    device_map['model.layers.28'] = 7
    device_map['model.layers.29'] = 7
    device_map['model.layers.30'] = 7
    device_map['model.layers.31'] = 7
    device_map['model.norm'] = 7
    device_map['lm_head'] = 7
@muellerzr
Copy link
Collaborator

cc @SunMarc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants