deepspeed zero stage 3 not compatible with bnb qunatization, leads to shape error #29266

Qizhang-Feng · 2024-02-24T02:47:41Z

System Info

transformers version: 4.38.1
Platform: Linux-5.10.209-198.812.amzn2.x86_64-x86_64-with-glibc2.26
Python version: 3.10.13
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- debug: True
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.0.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@pacman100 @SunMarc @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype = torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    script_args.model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    #use_auth_token=True,
    cache_dir=script_args.model_cache_dir,
    token=script_args.access_token,
)

deepspeed config yaml

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

return model_class.from_pretrained(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
) = cls._load_pretrained_model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

Expected behavior

Hello,

I encountered an issue while attempting to load the LLaMA 2 7B model using the bnb quantization configuration within the DeepSpeed Zero Stage 3 setup. An error occurs during the process, which I suspect might be related to the model's placeholder. However, I am uncertain whether this behavior is indicative of a bug or an intended feature.

Could you please clarify if this is a known issue or if there are any suggested workarounds? Thank you!

The text was updated successfully, but these errors were encountered:

younesbelkada · 2024-02-27T02:10:22Z

Hi @Qizhang-Feng
@pacman100 just merged officially working script on PEFT using various configurations of DeepSpeed: https://github.com/huggingface/peft/blob/main/examples/sft/README.md can you have a look there and report back if there is any issue? Mainly, you should use the same accelerate config as the one exposed there: https://github.com/huggingface/peft/blob/main/examples/sft/configs/deepspeed_config.yaml
Check out also: https://huggingface.co/docs/peft/accelerate/deepspeed
I think quantization + DS3 are not compatible though - perhaps @pacman100 can confirm

pacman100 · 2024-02-27T07:59:32Z

Hello, yes,DeepSpeed and bitsandbytes aren't compatible with each other.

github-actions · 2024-03-25T08:03:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

younesbelkada · 2024-04-04T10:41:41Z

Hi !
thanks to @pacman100 and team's work, DS and bnb are now compatible, please see: https://huggingface.co/docs/peft/accelerate/deepspeed for more details

Nagico mentioned this issue Mar 11, 2024

multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. axolotl-ai-cloud/axolotl#1240

Open

8 tasks

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed zero stage 3 not compatible with bnb qunatization, leads to shape error #29266

deepspeed zero stage 3 not compatible with bnb qunatization, leads to shape error #29266

Qizhang-Feng commented Feb 24, 2024

younesbelkada commented Feb 27, 2024

pacman100 commented Feb 27, 2024

github-actions bot commented Mar 25, 2024

younesbelkada commented Apr 4, 2024

deepspeed zero stage 3 not compatible with bnb qunatization, leads to shape error #29266

deepspeed zero stage 3 not compatible with bnb qunatization, leads to shape error #29266

Comments

Qizhang-Feng commented Feb 24, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

younesbelkada commented Feb 27, 2024

pacman100 commented Feb 27, 2024

github-actions bot commented Mar 25, 2024

younesbelkada commented Apr 4, 2024