Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_in_8bit doesn't work when set device_map #29691

Closed
2 of 4 tasks
Vinkle-hzt opened this issue Mar 16, 2024 · 2 comments · Fixed by #29958
Closed
2 of 4 tasks

load_in_8bit doesn't work when set device_map #29691

Vinkle-hzt opened this issue Mar 16, 2024 · 2 comments · Fixed by #29958

Comments

@Vinkle-hzt
Copy link

Vinkle-hzt commented Mar 16, 2024

System Info

platform nvidia/cuda:12.1.0-devel-ubuntu22.04
python 3.10.3
transformers 4.38.2

Who can help?

@ArthurZucker and @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. download model: yahma/llama-7b-hf
  2. load model by following code:
import torch

from transformers import BitsAndBytesConfig, LlamaForCausalLM

path = "/home/localmodel_path"
args = {
    "device_map": {
        "model.embed_tokens": "cpu",
        "model.layers.0": "cpu",
        "model.layers.1": "cpu",
        "model.layers.2": "cpu",
        "model.layers.3": "cpu",
        "model.layers.4": "cpu",
        "model.layers.5": "cpu",
        "model.layers.6": "cpu",
        "model.layers.7": "cpu",
        "model.layers.8": "cpu",
        "model.layers.9": "cpu",
        "model.layers.10": "cpu",
        "model.layers.11": "cuda:0",
        "model.layers.12": "cuda:0",
        "model.layers.13": "cuda:0",
        "model.layers.14": "cuda:0",
        "model.layers.15": "cuda:0",
        "model.layers.16": "cuda:0",
        "model.layers.17": "cuda:0",
        "model.layers.18": "cuda:0",
        "model.layers.19": "cuda:0",
        "model.layers.20": "cuda:0",
        "model.layers.21": "cuda:0",
        "model.layers.22": "cuda:0",
        "model.layers.23": "cpu",
        "model.layers.24": "cpu",
        "model.layers.25": "cpu",
        "model.layers.26": "cpu",
        "model.layers.27": "cpu",
        "model.layers.28": "cpu",
        "model.layers.29": "cpu",
        "model.layers.30": "cpu",
        "model.layers.31": "cpu",
        "model.norm": "cpu",
        "lm_head": "cpu",
    },
    "torch_dtype": torch.float32,
    "quantization_config": BitsAndBytesConfig(
        llm_int8_enable_fp32_cpu_offload=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        load_in_4bit=False,
        load_in_8bit=True,
    ),
}

model = LlamaForCausalLM.from_pretrained(path, **args)
  1. get warning message and unexpected GPU MEM use (9G+ expected: about 2.4G)
You are loading your model in 8bit or 4bit but no linear modules were found in your model. Please double check your model architecture, or submit an issue on github if you think this is a bug.
  1. by using pdb, you can see model.layer.11 hasn't been loaded in 8bit
(Pdb) p model.model.layers[11].self_attn.q_proj.weight.dtype
torch.float32
  1. However, when I just map top layers to GPU, it works well
args = {
    "device_map": {
        "model.embed_tokens": "cuda:0",
        "model.layers.0": "cuda:0",
        "model.layers.1": "cuda:0",
        "model.layers.2": "cuda:0",
        "model.layers.3": "cuda:0",
        "model.layers.4": "cuda:0",
        "model.layers.5": "cuda:0",
        "model.layers.6": "cuda:0",
        "model.layers.7": "cuda:0",
        "model.layers.8": "cuda:0",
        "model.layers.9": "cuda:0",
        "model.layers.10": "cuda:0",
        "model.layers.11": "cpu",
        "model.layers.12": "cpu",
        "model.layers.13": "cpu",
        "model.layers.14": "cpu",
        "model.layers.15": "cpu",
        "model.layers.16": "cpu",
        "model.layers.17": "cpu",
        "model.layers.18": "cpu",
        "model.layers.19": "cpu",
        "model.layers.20": "cpu",
        "model.layers.21": "cpu",
        "model.layers.22": "cpu",
        "model.layers.23": "cpu",
        "model.layers.24": "cpu",
        "model.layers.25": "cpu",
        "model.layers.26": "cpu",
        "model.layers.27": "cpu",
        "model.layers.28": "cpu",
        "model.layers.29": "cpu",
        "model.layers.30": "cpu",
        "model.layers.31": "cpu",
        "model.norm": "cpu",
        "lm_head": "cpu",
    },
    "torch_dtype": torch.float32,
    "quantization_config": BitsAndBytesConfig(
        llm_int8_enable_fp32_cpu_offload=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        load_in_4bit=False,
        load_in_8bit=True,
    ),
}
(Pdb) p model.model.layers[0].self_attn.q_proj.weight.dtype
torch.int8

Expected behavior

I hope when I set both device_map and quantization_config, it will correctly load the quantized model, even if only the middle layers are map to GPU. Especially, in my example, I want layer.11 to layer.22 to be correctly loaded in int8.

@ArthurZucker
Copy link
Collaborator

Pinging @SunMarc as well here!

@SunMarc
Copy link
Member

SunMarc commented Mar 29, 2024

Hi @Vinkle-hzt, thanks for reporting. This should be fixed in the above PR !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants