Skip to content

[BUG] use bloomz + hybrid_engine, but AttributeError: 'DS_BloomContainer' object has no attribute 'set_params_wo_copy' #3518

@shenzhuo

Description

@shenzhuo

Describe the bug
When use hybrid_engine + bloomz, zero2. An error was reported, it seems to tell me that bloomz does not support hybrid_engine

Log output

Traceback (most recent call last):
  File "DeepSpeedRLHF/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 634, in <module>
    main()
  File "DeepSpeedRLHF/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 428, in main
    rlhf_engine = DeepSpeedRLHFEngine(
  File "DeepSpeedRLHF/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 54, in __init__
    self.actor = self._init_actor(
  File "DeepSpeedRLHF/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 122, in _init_actor
    actor_engine, *_ = deepspeed.initialize(model=actor_model,
  File "venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 153, in initialize
    engine = DeepSpeedHybridEngine(args=args,
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 52, in __init__
    self.create_inference_module()
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 359, in create_inference_module
    self.create_inference_containers(self.module)
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
    self.create_inference_containers(child, layer_id=layer_id)
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
    self.create_inference_containers(child, layer_id=layer_id)
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 288, in create_inference_containers
    self._inference_containers.append(self.inference_policies[child.__class__][0](
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 111, in new_inference_container
    _container.set_params_wo_copy(Z3_enabled=self.Z3_enabled)
AttributeError: 'DS_BloomContainer' object has no attribute 'set_params_wo_copy'

To Reproduce
the run.sh is:

nohup sh training_scripts/single_node/run_bloom_1b7.sh \
  bigscience/bloomz-1b7 \
  bigscience/bloomz-1b7 \
  2 \
  2 \
  output_single_node_bloomz1b7 >train_test_zero2.log 2>&1 &

the run_bloom_1b7.sh is:

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=${3:-2}
CRITIC_ZERO_STAGE=${4:-2}
OUTPUT=${5:-'./output'}
NUM_GPUS=${6:-8}
NUM_NODES=${7:-1}
mkdir -p $OUTPUT

Num_Padding_at_Beginning=0 # this is model related

Actor_Lr=9.65e-6
Critic_Lr=5e-6
hostname='localhost'

export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export TOKENIZERS_PARALLELISM=false

deepspeed --master_port 25303 --master_addr ${hostname} --num_gpus ${NUM_GPUS} --num_nodes ${NUM_NODES} --hostfile 'deepspeed_hostfile' main.py \
  --data_path Dahoas/rm-static \
  --data_split 2,4,4 \
  --actor_model_name_or_path $ACTOR_MODEL_PATH \
  --critic_model_name_or_path $CRITIC_MODEL_PATH \
  --num_padding_at_beginning 1 \
  --per_device_train_batch_size 1 \
  --per_device_mini_train_batch_size 1 \
  --generation_batch_numbers 1 \
  --ppo_epochs 1 \
  --max_answer_seq_len 256 \
  --max_prompt_seq_len 256 \
  --actor_learning_rate ${Actor_Lr} \
  --critic_learning_rate ${Critic_Lr} \
  --disable_actor_dropout \
  --num_train_epochs 1 \
  --lr_scheduler_type cosine \
  --gradient_accumulation_steps 1 \
  --num_warmup_steps 100 \
  --deepspeed --seed 1234 \
  --enable_hybrid_engine \
  --inference_tp_size ${NUM_NODES} \
  --tp_gather_partition_size ${NUM_GPUS} \
  --actor_zero_stage $ACTOR_ZERO_STAGE \
  --critic_zero_stage $CRITIC_ZERO_STAGE \
  --actor_gradient_checkpointing \
  --critic_gradient_checkpointing \
  --output_dir $OUTPUT |&
  tee $OUTPUT/training.log

Expected behavior
DS_BloomContainer has attribute 'set_params_wo_copy' and can use hybrid engine to train

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/venv/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/usr/local/venv/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.3+194053b, 194053b, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

Screenshots
no. The error is in the Log output

System info (please complete the following information):

  • OS: Linux version 4.18.0-240.el8.x86_64. CentOS Linux 7 (Core).
  • GPU count and types: one machine with x8 A100s each
  • Python version: 3.9.13

Docker context
no

Additional context
no

@cmikeh2 @jeffra @lekurile @awan-10

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdeepspeed-chatRelated to DeepSpeed-Chatnew_configRelated to new configurations and models

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions