-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend DeepSpeed integration to ZeRO-{1,2,3} #758
Conversation
The documentation is not available anymore as the PR was closed or merged. |
@@ -8,7 +8,7 @@ distributed_type: DEEPSPEED | |||
downcast_bf16: 'no' | |||
machine_rank: 0 | |||
main_training_function: main | |||
mixed_precision: 'no' | |||
mixed_precision: 'bf16' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can now set this as the default since we initialise both the reference and active models with DeepSpeed
@@ -38,6 +38,9 @@ class ScriptArguments: | |||
# NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode | |||
# models like gpt-neo* models are more suitable. | |||
model_name: Optional[str] = field(default="lvwerra/gpt2-imdb", metadata={"help": "the model name"}) | |||
reward_model_name: Optional[str] = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this arg to make it easier to configure the running of this script
deepspeed_plugin = self.accelerator.state.deepspeed_plugin | ||
batch_size_per_device = deepspeed_plugin.deepspeed_config["train_micro_batch_size_per_gpu"] | ||
# See DeepSpeed docs for definition of these parameters: https://deepspeed.readthedocs.io/en/latest/zero3.html | ||
config_kwargs = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these parameters are set automatically by accelerate
and this don't need duplicating. One check I need to make is the inclusion of gradient accumulation.
Update: yes, train_batch_size
does reflect the size of gradient accumulation as well, so this is fine to be removed IMO
@lewtun nice work! I love that there are gpt2 runs across different zero stages. Could you also test zero 2 + 3 on larger models such as falcon 7b or cerebras GPT 6.7B? |
Yes, I'm running the Cerebras models as we speak and will report back when the runs are done :) |
Update on running 3 x 6.7B models with DeepSpeed on
Here's the command I used to test: accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{2,3}.yaml examples/scripts/sentiment_tuning.py --batch_size 32 --mini_batch_size 32 --
log_with wandb --model_name cerebras/Cerebras-GPT-6.7B --reward_model_name cerebras/Cerebras-GPT-6.7B Interestingly, although ZeRO-3 is less memory intensive, the savings aren't as high as I would have expected on a single node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great to me, thanks a lot for the great investigation and spending some time on benchmarking and testing, to make sure we now support DS Zero 1, 2, 3!
# Some tokenizers like GPT-2's don't have a padding token by default, so we set one here. | ||
if sentiment_pipe.tokenizer.pad_token_id is None: | ||
sentiment_pipe.tokenizer.pad_token_id = tokenizer.pad_token_id | ||
|
||
if sentiment_pipe.model.config.pad_token_id is None: | ||
sentiment_pipe.model.config.pad_token_id = tokenizer.pad_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why this was not needed before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's usually not needed if you've already trained a proper reward model because this comes with a proper padding token. However, if you want to plug and play with any causal LM on the Hub then this is typically needed to avoid throwing errors in the pipeline
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Great work @lewtun
Thanks in Advance! |
Hi @uahmad235 ! Here's answers to your questions:
It will likely be tight to fit 3 x 7B models on 2 x A6000s, so one possibility would be to quantize the reward model by passing Hope that helps! |
Thanks for the info @lewtun. However, using Seems like i might have to go for a pair of A100s. |
* Generalise deepspeed * Refactor * Add reward model arg * Fix pipeline tokenizer * Fix deprecation * Pin deepspeed lower * Fix docs * Revert top_k change * Add ZeRO-3 context manager * Revert docs change * Fix docs * Polish docs * Update docs/source/customization.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
* Generalise deepspeed * Refactor * Add reward model arg * Fix pipeline tokenizer * Fix deprecation * Pin deepspeed lower * Fix docs * Revert top_k change * Add ZeRO-3 context manager * Revert docs change * Fix docs * Polish docs * Update docs/source/customization.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
This PR extends the DeepSpeed initialization of the reference model to work with all stages of DeepSpeed ZeRO.
I'll share some plots of the GPT-2 runs on sentiment tuning shortly, but the code should be good for a review.
Tested with:
Here's the screenshots of the various runs on wandb: https://wandb.ai/huggingface/trl?workspace=user-lewtun
Overall, getting good agreement between the baseline (no DeepSpeed) and stages 1 & 2, while stage 3 has a noticeable discrepancy in the value loss that is worth digging into in a separate issue IMO.