Extend DeepSpeed integration to ZeRO-{1,2,3} #758

lewtun · 2023-09-12T08:27:53Z

This PR extends the DeepSpeed initialization of the reference model to work with all stages of DeepSpeed ZeRO.

I'll share some plots of the GPT-2 runs on sentiment tuning shortly, but the code should be good for a review.

Tested with:

accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{1,2,3}.yaml examples/scripts/sentiment_tuning.py --batch_size 32 --mini_batch_size 32 --log_with wandb

Here's the screenshots of the various runs on wandb: https://wandb.ai/huggingface/trl?workspace=user-lewtun

Overall, getting good agreement between the baseline (no DeepSpeed) and stages 1 & 2, while stage 3 has a noticeable discrepancy in the value loss that is worth digging into in a separate issue IMO.

HuggingFaceDocBuilderDev · 2023-09-12T08:34:00Z

The documentation is not available anymore as the PR was closed or merged.

docs/source/customization.mdx

lewtun · 2023-09-12T10:10:47Z

examples/accelerate_configs/deepspeed_zero1.yaml

@@ -8,7 +8,7 @@ distributed_type: DEEPSPEED
 downcast_bf16: 'no'
 machine_rank: 0
 main_training_function: main
-mixed_precision: 'no'
+mixed_precision: 'bf16'


We can now set this as the default since we initialise both the reference and active models with DeepSpeed

lewtun · 2023-09-12T10:11:13Z

examples/scripts/sentiment_tuning.py

@@ -38,6 +38,9 @@ class ScriptArguments:
    # NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode
    # models like gpt-neo* models are more suitable.
    model_name: Optional[str] = field(default="lvwerra/gpt2-imdb", metadata={"help": "the model name"})
+    reward_model_name: Optional[str] = field(


I've added this arg to make it easier to configure the running of this script

lewtun · 2023-09-12T10:12:00Z

trl/trainer/ppo_trainer.py

        deepspeed_plugin = self.accelerator.state.deepspeed_plugin
-        batch_size_per_device = deepspeed_plugin.deepspeed_config["train_micro_batch_size_per_gpu"]
-        # See DeepSpeed docs for definition of these parameters: https://deepspeed.readthedocs.io/en/latest/zero3.html
-        config_kwargs = {


All these parameters are set automatically by accelerate and this don't need duplicating. One check I need to make is the inclusion of gradient accumulation.

Update: yes, train_batch_size does reflect the size of gradient accumulation as well, so this is fine to be removed IMO

vwxyzjn · 2023-09-12T13:00:26Z

@lewtun nice work! I love that there are gpt2 runs across different zero stages. Could you also test zero 2 + 3 on larger models such as falcon 7b or cerebras GPT 6.7B?

lewtun · 2023-09-12T13:18:31Z

@lewtun nice work! I love that there are gpt2 runs across different zero stages. Could you also test zero 2 + 3 on larger models such as falcon 7b or cerebras GPT 6.7B?

Yes, I'm running the Cerebras models as we speak and will report back when the runs are done :)

lewtun · 2023-09-12T13:49:31Z

Update on running 3 x 6.7B models with DeepSpeed on sentiment_tuning.py:

ZeRO-2 works fine and doesn't OOM
ZeRO-3 works fine and doesn't OOM

Here's the command I used to test:

accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{2,3}.yaml examples/scripts/sentiment_tuning.py --batch_size 32 --mini_batch_size 32 --
log_with wandb --model_name cerebras/Cerebras-GPT-6.7B --reward_model_name cerebras/Cerebras-GPT-6.7B

Interestingly, although ZeRO-3 is less memory intensive, the savings aren't as high as I would have expected on a single node:

younesbelkada

Looks really great to me, thanks a lot for the great investigation and spending some time on benchmarking and testing, to make sure we now support DS Zero 1, 2, 3!

younesbelkada · 2023-09-12T14:50:49Z

examples/scripts/sentiment_tuning.py

+# Some tokenizers like GPT-2's don't have a padding token by default, so we set one here.
+if sentiment_pipe.tokenizer.pad_token_id is None:
+    sentiment_pipe.tokenizer.pad_token_id = tokenizer.pad_token_id
+
+if sentiment_pipe.model.config.pad_token_id is None:
+    sentiment_pipe.model.config.pad_token_id = tokenizer.pad_token_id


Do you know why this was not needed before?

It's usually not needed if you've already trained a proper reward model because this comes with a proper padding token. However, if you want to plug and play with any causal LM on the Hub then this is typically needed to avoid throwing errors in the pipeline

docs/source/customization.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

uahmad235 · 2023-09-13T13:40:57Z

Great work @lewtun
I am trying to replicate your experiments but going OOM on 2xA6000s.
Would be really helpful if you could provide following info:

What GPUs are you using
How much RAM does your training machine have
Are you using default deepspeed config (from repo) or have you modified anything like offload param?

Thanks in Advance!

lewtun · 2023-09-13T16:20:33Z

Hi @uahmad235 !

Here's answers to your questions:

I'm using a single node of 8 x A100s with 80GB vRAM each
I have 1TB CPU RAM
I am using the default deepspeed configs from the repo with no changes (i.e. no offloading)

It will likely be tight to fit 3 x 7B models on 2 x A6000s, so one possibility would be to quantize the reward model by passing load_in_8bit=True. Alternatively, you could try shrinking the batch size & increasing the number of gradient accumulation steps.

Hope that helps!

uahmad235 · 2023-09-13T16:56:31Z

Thanks for the info @lewtun.
I am actually using the default reward model (i.e., lvwerra/distilbert-imdb) so it is not too heavy on the memory.
And yeah, I tried using load_in_8bit with batchsize=4. It runs into runtime error at ppo_trainer.step(). Here's the issue i get:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

However, using torch_dtype=torch.bfloat16 does help a little with memory but not enough that it can allow me to make gradient updates even with batch_size=2

Seems like i might have to go for a pair of A100s.

* Generalise deepspeed * Refactor * Add reward model arg * Fix pipeline tokenizer * Fix deprecation * Pin deepspeed lower * Fix docs * Revert top_k change * Add ZeRO-3 context manager * Revert docs change * Fix docs * Polish docs * Update docs/source/customization.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

lewtun added 2 commits September 12, 2023 07:29

Generalise deepspeed

ad9d59f

Refactor

87c59f1

lewtun added 6 commits September 12, 2023 08:59

Add reward model arg

44afcf1

Fix pipeline tokenizer

104148e

Fix deprecation

e3c299b

Pin deepspeed lower

9fe42ee

Fix docs

76e9350

Revert top_k change

ae52b65

lewtun changed the title ~~[WIP] Extend DeepSpeed integration to ZeRO-{1,2,3}~~ Extend DeepSpeed integration to ZeRO-{1,2,3} Sep 12, 2023

lewtun requested review from younesbelkada and vwxyzjn September 12, 2023 10:09

lewtun marked this pull request as ready for review September 12, 2023 10:09

lewtun commented Sep 12, 2023

View reviewed changes

docs/source/customization.mdx Show resolved Hide resolved

lewtun commented Sep 12, 2023

View reviewed changes

lewtun mentioned this pull request Sep 12, 2023

Deepspeed integration for 7B models vwxyzjn/lm-human-preference-details#19

Merged

lewtun added 2 commits September 12, 2023 13:13

Add ZeRO-3 context manager

19e8f47

Revert docs change

fd4c07c

Fix docs

05db626

Polish docs

4ec7d97

younesbelkada approved these changes Sep 12, 2023

View reviewed changes

Update docs/source/customization.mdx

3915faa

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

lewtun merged commit c02ce6d into main Sep 12, 2023

lewtun deleted the zero-2-integration branch September 12, 2023 16:59

younesbelkada mentioned this pull request Sep 13, 2023

Device placement for active, reference and reward model in PPOTrainer #532

Closed

lewtun mentioned this pull request Sep 13, 2023

DeepSpeed ZeRO-2 produces negative KL divergence #506

Closed

younesbelkada mentioned this pull request Sep 14, 2023

Unable to make the deepspeed zero3 integration work with falcon7b #739

Closed

lewtun mentioned this pull request Sep 29, 2023

Fix: RuntimeError: 'weight' must be 2-D issue #687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend DeepSpeed integration to ZeRO-{1,2,3} #758

Extend DeepSpeed integration to ZeRO-{1,2,3} #758

lewtun commented Sep 12, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 12, 2023 •

edited

Loading

lewtun Sep 12, 2023

lewtun Sep 12, 2023

lewtun Sep 12, 2023 •

edited

Loading

vwxyzjn commented Sep 12, 2023

lewtun commented Sep 12, 2023

lewtun commented Sep 12, 2023

younesbelkada left a comment •

edited

Loading

younesbelkada Sep 12, 2023

lewtun Sep 12, 2023

uahmad235 commented Sep 13, 2023

lewtun commented Sep 13, 2023

uahmad235 commented Sep 13, 2023

Extend DeepSpeed integration to ZeRO-{1,2,3} #758

Extend DeepSpeed integration to ZeRO-{1,2,3} #758

Conversation

lewtun commented Sep 12, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Sep 12, 2023 • edited Loading

lewtun Sep 12, 2023

Choose a reason for hiding this comment

lewtun Sep 12, 2023

Choose a reason for hiding this comment

lewtun Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

vwxyzjn commented Sep 12, 2023

lewtun commented Sep 12, 2023

lewtun commented Sep 12, 2023

younesbelkada left a comment • edited Loading

Choose a reason for hiding this comment

younesbelkada Sep 12, 2023

Choose a reason for hiding this comment

lewtun Sep 12, 2023

Choose a reason for hiding this comment

uahmad235 commented Sep 13, 2023

lewtun commented Sep 13, 2023

uahmad235 commented Sep 13, 2023

lewtun commented Sep 12, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 12, 2023 •

edited

Loading

lewtun Sep 12, 2023 •

edited

Loading

younesbelkada left a comment •

edited

Loading