Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed integration for 7B models #19

Merged
merged 15 commits into from
Sep 17, 2023
Merged

Deepspeed integration for 7B models #19

merged 15 commits into from
Sep 17, 2023

Conversation

vwxyzjn
Copy link
Owner

@vwxyzjn vwxyzjn commented Sep 6, 2023

This PR attempts to bring deepspeed integration to empower tuning with 7B models. In the summarize from human feedback paper, the experimented with 1.3B, 2.7B, and 6.7B models, so this PR would in principle allow us to replicate that work.

Some of the notable changes needed to make things work:

  • mixed_precision: 'bf16' turns out to be important, otherwise OOM.
  • Initialize all models on CPU first and use accelerator.prepare and deepspeed.initialize, otherwise OOM.
  • Enable bf16 for reward_model and ref_policy, otherwise OOM.
  • Do not log the histogram of ratio, otherwise OOM.
  • Enable gradient checkpointing, otherwise OOM.
  • In https://github.com/OpenLMLab/MOSS-RLHF/blob/40b91eb2f2b71b16919addede0341d2bef70825d/utils.py#L41-L43, they have an additional critic_model, which they finally offload reward_model, critic_model, and ref_policy to CPU, but it is not necessary in our case.

Here is a training run https://wandb.ai/costa-huang/cleanRL/runs/kve7tu43/overview with

accelerate launch --config_file deepspeed.yaml lm_human_preference_details/train_policy_accelerate.py \
    --rewards.trained_model ''  \
    --base_model tiiuae/falcon-7b  \
    --no_use_tensorflow_adam \
    --ppo.gradient_accumulation_steps 64 \
    --track

Training results was pretty bad, but I think this is probably some issue related to model compatibility. To replicate summarize from human feedback paper, we should probably use the OPT models which have 1.3B, 2.7B, and 6.7B models.

CC @lewtun

deepspeed_states = AcceleratorState().deepspeed_plugin
deepspeed_states.deepspeed_config['train_micro_batch_size_per_gpu'] = args.ppo.local_micro_batch_size
deepspeed_states.deepspeed_config['checkpoint'] = {'use_node_local_storage': True}
off_load_device = "cpu"
Copy link

@lewtun lewtun Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will slow down your code significantly. I would allow the option to be set as an option that's inferred from the accelerate config as I did here: huggingface/trl#758


deepspeed_states = AcceleratorState().deepspeed_plugin
deepspeed_states.deepspeed_config['train_micro_batch_size_per_gpu'] = args.ppo.local_micro_batch_size
deepspeed_states.deepspeed_config['checkpoint'] = {'use_node_local_storage': True}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this flag is only needed if each node has a separate local filesystem. For the HFC case, you probably don't need it

import deepspeed

deepspeed_states = AcceleratorState().deepspeed_plugin
deepspeed_states.deepspeed_config['train_micro_batch_size_per_gpu'] = args.ppo.local_micro_batch_size
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, these config values are set automatically by the accelerator and don't need to be overridden

"bf16": {
"enabled": True
},
"prescale_gradients": False,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this flag and the one below are false by default, so probably don't need to be set either

@@ -755,7 +790,8 @@ def train(args: Args):
)

with torch.no_grad():
writer.add_histogram("ppo/val/ratio_hist", ratio, update)
if not args.deepspeed: # for some reason there is a OOM with the `writer.add_histogram`
writer.add_histogram("ppo/val/ratio_hist", ratio, update)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I was able to train 7B models in TRL with ZeRO-2 and didn't need to remove the histogram. On the other hand that was for sentiment tuning, which is less memory intensive than your application here

@vwxyzjn vwxyzjn marked this pull request as ready for review September 16, 2023 21:53
@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Sep 17, 2023

Confirmed that it can reasonably run 7b models (no benchmark results yet)


SAVE_PATH_REWARD="models/train_7b_$(date +%s)/reward.pt"
SAVE_PATH_POLICY="models/train_7b_$(date +%s)/policy.pt"
poetry run accelerate launch --config_file deepspeed.yaml  lm_human_preference_details/train_reward_accelerate.py \
    --base_model cerebras/Cerebras-GPT-6.7B \
    --no_use_tensorflow_adam \
    --gradient_accumulation_steps=4 \
    --local_rollout_batch_size=4 \
    --save_path=$SAVE_PATH_REWARD \
    --track && \
    poetry run accelerate launch --config_file deepspeed.yaml  lm_human_preference_details/train_policy_accelerate.py \
    --rewards.trained_model=$SAVE_PATH_REWARD \
    --base_model=cerebras/Cerebras-GPT-6.7B \
    --deepspeed \
    --no_use_tensorflow_adam \
    --ppo.gradient_accumulation_steps 64 \
    --track

https://wandb.ai/costa-huang/cleanRL/runs/hn9wtka9?workspace=user-costa-huang

image

@vwxyzjn vwxyzjn merged commit 48f709d into main Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants