Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed zero stage 3 not compatible with bnb qunatization, leads to shape error #29266

Closed
2 of 4 tasks
Qizhang-Feng opened this issue Feb 24, 2024 · 4 comments
Closed
2 of 4 tasks

Comments

@Qizhang-Feng
Copy link

System Info

  • transformers version: 4.38.1
  • Platform: Linux-5.10.209-198.812.amzn2.x86_64-x86_64-with-glibc2.26
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: fp16
    - use_cpu: False
    - debug: True
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.0.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@pacman100 @SunMarc @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype = torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    script_args.model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    #use_auth_token=True,
    cache_dir=script_args.model_cache_dir,
    token=script_args.access_token,
)

deepspeed config yaml

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

return model_class.from_pretrained(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
) = cls._load_pretrained_model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3926, in _load_pretrained_model
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

Expected behavior

Hello,

I encountered an issue while attempting to load the LLaMA 2 7B model using the bnb quantization configuration within the DeepSpeed Zero Stage 3 setup. An error occurs during the process, which I suspect might be related to the model's placeholder. However, I am uncertain whether this behavior is indicative of a bug or an intended feature.

Could you please clarify if this is a known issue or if there are any suggested workarounds? Thank you!

@younesbelkada
Copy link
Contributor

Hi @Qizhang-Feng
@pacman100 just merged officially working script on PEFT using various configurations of DeepSpeed: https://github.com/huggingface/peft/blob/main/examples/sft/README.md can you have a look there and report back if there is any issue? Mainly, you should use the same accelerate config as the one exposed there: https://github.com/huggingface/peft/blob/main/examples/sft/configs/deepspeed_config.yaml
Check out also: https://huggingface.co/docs/peft/accelerate/deepspeed
I think quantization + DS3 are not compatible though - perhaps @pacman100 can confirm

@pacman100
Copy link
Contributor

Hello, yes,DeepSpeed and bitsandbytes aren't compatible with each other.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Apr 3, 2024
@younesbelkada
Copy link
Contributor

Hi !
thanks to @pacman100 and team's work, DS and bnb are now compatible, please see: https://huggingface.co/docs/peft/accelerate/deepspeed for more details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants