[BUG] use bloomz + hybrid_engine, but AttributeError: 'DS_BloomContainer' object has no attribute 'set_params_wo_copy'

**Describe the bug**
When use hybrid_engine + bloomz, zero2. An error was reported, it seems to tell me that bloomz does not support hybrid_engine

**Log output**
```
Traceback (most recent call last):
  File "DeepSpeedRLHF/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 634, in <module>
    main()
  File "DeepSpeedRLHF/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 428, in main
    rlhf_engine = DeepSpeedRLHFEngine(
  File "DeepSpeedRLHF/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 54, in __init__
    self.actor = self._init_actor(
  File "DeepSpeedRLHF/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 122, in _init_actor
    actor_engine, *_ = deepspeed.initialize(model=actor_model,
  File "venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 153, in initialize
    engine = DeepSpeedHybridEngine(args=args,
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 52, in __init__
    self.create_inference_module()
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 359, in create_inference_module
    self.create_inference_containers(self.module)
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
    self.create_inference_containers(child, layer_id=layer_id)
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
    self.create_inference_containers(child, layer_id=layer_id)
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 288, in create_inference_containers
    self._inference_containers.append(self.inference_policies[child.__class__][0](
  File "venv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 111, in new_inference_container
    _container.set_params_wo_copy(Z3_enabled=self.Z3_enabled)
AttributeError: 'DS_BloomContainer' object has no attribute 'set_params_wo_copy'
```

**To Reproduce**
the `run.sh` is:
```
nohup sh training_scripts/single_node/run_bloom_1b7.sh \
  bigscience/bloomz-1b7 \
  bigscience/bloomz-1b7 \
  2 \
  2 \
  output_single_node_bloomz1b7 >train_test_zero2.log 2>&1 &
```
the `run_bloom_1b7.sh` is:
```
#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=${3:-2}
CRITIC_ZERO_STAGE=${4:-2}
OUTPUT=${5:-'./output'}
NUM_GPUS=${6:-8}
NUM_NODES=${7:-1}
mkdir -p $OUTPUT

Num_Padding_at_Beginning=0 # this is model related

Actor_Lr=9.65e-6
Critic_Lr=5e-6
hostname='localhost'

export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export TOKENIZERS_PARALLELISM=false

deepspeed --master_port 25303 --master_addr ${hostname} --num_gpus ${NUM_GPUS} --num_nodes ${NUM_NODES} --hostfile 'deepspeed_hostfile' main.py \
  --data_path Dahoas/rm-static \
  --data_split 2,4,4 \
  --actor_model_name_or_path $ACTOR_MODEL_PATH \
  --critic_model_name_or_path $CRITIC_MODEL_PATH \
  --num_padding_at_beginning 1 \
  --per_device_train_batch_size 1 \
  --per_device_mini_train_batch_size 1 \
  --generation_batch_numbers 1 \
  --ppo_epochs 1 \
  --max_answer_seq_len 256 \
  --max_prompt_seq_len 256 \
  --actor_learning_rate ${Actor_Lr} \
  --critic_learning_rate ${Critic_Lr} \
  --disable_actor_dropout \
  --num_train_epochs 1 \
  --lr_scheduler_type cosine \
  --gradient_accumulation_steps 1 \
  --num_warmup_steps 100 \
  --deepspeed --seed 1234 \
  --enable_hybrid_engine \
  --inference_tp_size ${NUM_NODES} \
  --tp_gather_partition_size ${NUM_GPUS} \
  --actor_zero_stage $ACTOR_ZERO_STAGE \
  --critic_zero_stage $CRITIC_ZERO_STAGE \
  --actor_gradient_checkpointing \
  --critic_gradient_checkpointing \
  --output_dir $OUTPUT |&
  tee $OUTPUT/training.log
```

**Expected behavior**
DS_BloomContainer has attribute 'set_params_wo_copy' and can use hybrid engine to train

**ds_report output**
```
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/venv/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/usr/local/venv/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.3+194053b, 194053b, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
```

**Screenshots**
no. The error is in the `Log output`

**System info (please complete the following information):**
 - OS: Linux version 4.18.0-240.el8.x86_64. CentOS Linux 7 (Core).
 - GPU count and types: one machine with x8 A100s each
 - Python version:  3.9.13

**Docker context**
no 

**Additional context**
no


@cmikeh2 @jeffra @lekurile @awan-10 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] use bloomz + hybrid_engine, but AttributeError: 'DS_BloomContainer' object has no attribute 'set_params_wo_copy' #3518

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] use bloomz + hybrid_engine, but AttributeError: 'DS_BloomContainer' object has no attribute 'set_params_wo_copy' #3518

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions