load_in_8bit doesn't work when set device_map #29691

Vinkle-hzt · 2024-03-16T16:05:39Z

System Info

platform nvidia/cuda:12.1.0-devel-ubuntu22.04
python 3.10.3
transformers 4.38.2

Who can help?

@ArthurZucker and @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

download model: yahma/llama-7b-hf
load model by following code:

import torch

from transformers import BitsAndBytesConfig, LlamaForCausalLM

path = "/home/localmodel_path"
args = {
    "device_map": {
        "model.embed_tokens": "cpu",
        "model.layers.0": "cpu",
        "model.layers.1": "cpu",
        "model.layers.2": "cpu",
        "model.layers.3": "cpu",
        "model.layers.4": "cpu",
        "model.layers.5": "cpu",
        "model.layers.6": "cpu",
        "model.layers.7": "cpu",
        "model.layers.8": "cpu",
        "model.layers.9": "cpu",
        "model.layers.10": "cpu",
        "model.layers.11": "cuda:0",
        "model.layers.12": "cuda:0",
        "model.layers.13": "cuda:0",
        "model.layers.14": "cuda:0",
        "model.layers.15": "cuda:0",
        "model.layers.16": "cuda:0",
        "model.layers.17": "cuda:0",
        "model.layers.18": "cuda:0",
        "model.layers.19": "cuda:0",
        "model.layers.20": "cuda:0",
        "model.layers.21": "cuda:0",
        "model.layers.22": "cuda:0",
        "model.layers.23": "cpu",
        "model.layers.24": "cpu",
        "model.layers.25": "cpu",
        "model.layers.26": "cpu",
        "model.layers.27": "cpu",
        "model.layers.28": "cpu",
        "model.layers.29": "cpu",
        "model.layers.30": "cpu",
        "model.layers.31": "cpu",
        "model.norm": "cpu",
        "lm_head": "cpu",
    },
    "torch_dtype": torch.float32,
    "quantization_config": BitsAndBytesConfig(
        llm_int8_enable_fp32_cpu_offload=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        load_in_4bit=False,
        load_in_8bit=True,
    ),
}

model = LlamaForCausalLM.from_pretrained(path, **args)

get warning message and unexpected GPU MEM use (9G+ expected: about 2.4G)

You are loading your model in 8bit or 4bit but no linear modules were found in your model. Please double check your model architecture, or submit an issue on github if you think this is a bug.

by using pdb, you can see model.layer.11 hasn't been loaded in 8bit

(Pdb) p model.model.layers[11].self_attn.q_proj.weight.dtype
torch.float32

However, when I just map top layers to GPU, it works well

args = {
    "device_map": {
        "model.embed_tokens": "cuda:0",
        "model.layers.0": "cuda:0",
        "model.layers.1": "cuda:0",
        "model.layers.2": "cuda:0",
        "model.layers.3": "cuda:0",
        "model.layers.4": "cuda:0",
        "model.layers.5": "cuda:0",
        "model.layers.6": "cuda:0",
        "model.layers.7": "cuda:0",
        "model.layers.8": "cuda:0",
        "model.layers.9": "cuda:0",
        "model.layers.10": "cuda:0",
        "model.layers.11": "cpu",
        "model.layers.12": "cpu",
        "model.layers.13": "cpu",
        "model.layers.14": "cpu",
        "model.layers.15": "cpu",
        "model.layers.16": "cpu",
        "model.layers.17": "cpu",
        "model.layers.18": "cpu",
        "model.layers.19": "cpu",
        "model.layers.20": "cpu",
        "model.layers.21": "cpu",
        "model.layers.22": "cpu",
        "model.layers.23": "cpu",
        "model.layers.24": "cpu",
        "model.layers.25": "cpu",
        "model.layers.26": "cpu",
        "model.layers.27": "cpu",
        "model.layers.28": "cpu",
        "model.layers.29": "cpu",
        "model.layers.30": "cpu",
        "model.layers.31": "cpu",
        "model.norm": "cpu",
        "lm_head": "cpu",
    },
    "torch_dtype": torch.float32,
    "quantization_config": BitsAndBytesConfig(
        llm_int8_enable_fp32_cpu_offload=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        load_in_4bit=False,
        load_in_8bit=True,
    ),
}

(Pdb) p model.model.layers[0].self_attn.q_proj.weight.dtype
torch.int8

Expected behavior

I hope when I set both device_map and quantization_config, it will correctly load the quantized model, even if only the middle layers are map to GPU. Especially, in my example, I want layer.11 to layer.22 to be correctly loaded in int8.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-03-25T08:02:57Z

Pinging @SunMarc as well here!

SunMarc · 2024-03-29T17:31:30Z

Hi @Vinkle-hzt, thanks for reporting. This should be fixed in the above PR !

SunMarc mentioned this issue Mar 29, 2024

[bnb] Fix bug in _replace_with_bnb_linear #29958

Merged

SunMarc closed this as completed in #29958 Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_in_8bit doesn't work when set device_map #29691

load_in_8bit doesn't work when set device_map #29691

Vinkle-hzt commented Mar 16, 2024 •

edited

Loading

ArthurZucker commented Mar 25, 2024

SunMarc commented Mar 29, 2024

load_in_8bit doesn't work when set device_map #29691

load_in_8bit doesn't work when set device_map #29691

Comments

Vinkle-hzt commented Mar 16, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Mar 25, 2024

SunMarc commented Mar 29, 2024

Vinkle-hzt commented Mar 16, 2024 •

edited

Loading