-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support deepspeed #1101
support deepspeed #1101
Conversation
hello. so we add but what is your_stage here? thank you what is zero_stage? |
your stage means one of ZeRO stage, [0, 1, 2, 3]. Details are in ZeRO documents.
to shard them into multi-gpus. |
for 2 gpus which one you suggest? like --deepspeed --zero_stage=1? what about 3 gpus? currently it clone entire training on each gpus as far as i know so can you tell some suggested guidelines? |
I think that ZeRO-2 stage is optimal stage if you have enough amount of VRAM. i.e. --deepspeed --zero_stage=2 only. Without offloading, total amount of VRAM is still a major factor to decide training parameters. You have to choose between training speed and saving VRAM, which will require a trade-off. The number of GPUs is important, but deciding on the ZeRO stage is a strategy to choose the method you like better. And if you want to use cpu/nvme offload, you might meet some kind of error in this commit. I will fix it soon. But you can use ZeRO-2 stage without offloading in this commit. |
when i tested fp16 and bf16 on SD 1.5 I had horrible results. are you able to get any decent results? |
…ccelerate settings
any update here? @kohya-ss |
Here is a report of deepspeed in sd-scripts EnvironmentModel
GPUs
No NVLink make bottleneck among GPU communication. Requirements
Training Settings
With 24GB VRAM, sd-scripts can run barely PEFT, and can not run FT on DDP. ExperimentFisrt, I'm sorry for some missing element of table. My budget is limited. Lower is better. full_fp16, FTAverage VRAM usage(MB)
Training Speed(s/it)
full_fp16, PEFTAverage VRAM usage(MB)
Training Speed(s/it)
*SIGABRT occurred. IDK why. full_bf16, FTAverage VRAM usage(MB)
Training Speed(s/it)
*Ideally this OOM should not to be happened. full_bf16, PEFTAverage VRAM usage(MB)
Training Speed(s/it)
*Ideally this OOM should not to be happened. bf16, FTAverage VRAM usage(MB)
Training Speed(s/it)
bf16, PEFTAverage VRAM usage(MB)
Training Speed(s/it)
*I lost this element. Results
|
the only way to utilize multiple consumer GPU is i think cloning the training if you don't have pro GPUs. |
Why not use bf16 for these cards?
Could you please compare it with training on a single card? As far as I remember I get roughly the same speeds with just one card. |
full_bf16 and bf16 is on running. For a single card, training speed is almost same as DDP, DDP is slight slower. |
The results look very promising, I can help you by providing you with different multi-gpu machines if it will help with your tests. |
what is your effective batch size? this is cloned on each GPU? like 2 gpu means 4 * 16 * 2 ? |
Thank your suggestion. But I think now-days Diffusion Models are not too big and necessary to run with multi-machines or multi-nodes. |
Effective Batch calculation in sd-scripts is on below [effective batch] = [number of machine] x [number of gpus] x [train_batch_size] x [gradient_accumulation_steps] For example, in my 2 gpu settings. effective batch = 1 machine x 2 gpus x 4 train batch size x 16 gradient accumulation steps |
Actually, I was talking about multi-GPUs, not multi-machines. If you want to test on 8xa100 and a10g, please send a message. |
I tried it with a clean installation on a new machineto do a test, but I received warnings and errors that were too long to include here, unfortunately the result was unsuccessful. Could there be a requirement you missed? |
Here is a simple yet all about installation. installationI recommend to use deepspeed anaconda envrionment. First, you need to clone my deepspeed branch.
CONFIG_FILE is a path to above toml file.
|
Thank you for this great PR! It looks very nice. However, I dont' have an environment to test DeepSpeed. I know that I can test it with cloud environments, but I prefer to develop other features than testing DeepSpeed. In addition, the update to the scripts is not a little, so it will be a little hard to maintain. Therefore, is it OK if I move the features to the single script which supports DeepSpeed as much as possible after merging? I will make a new branch for it, and I'd be happy if you test and review the branch. I think that if the script works well, it will not be required for me to maintain the script in future, and someone would update the script if necessary. |
Sounds good. It is good to move DeepSpeed features into dev-branch and to postpone merging it. |
@BootsofLagrangian I have a question for DeepSpeedWrapper. In my understanding, Accelerate library does not support multiple models for deepspeed. So, we need to wrap multiple models into a single model. If this is correct, when we pass the wrapper to Therefore, |
Yep, that is correct. accelerate do something magical, accumulate method accepts accelerate-compatible Modules. I tested |
Thank you for clarification! I opened a new PR #1139, I would appreciate your comments and suggestions. |
I got very optimistic results! My dataset: 353 images Training time Unbelievable, but 3x4090 faster than 1 H100 Pcie. @BootsofLagrangian you are a wizard! |
@storuky wow nice results |
Test Done!!
Introduction
This PR adds DeepSpeed support via Accelerate to sd-scripts, aiming to improve multi-GPU training with ZeRO-Stage. I've made these changes in my fork under the
branch-deepspeed
and I'm open to any feedback!0. Environment
1. Install DeepSpeed
First, activate your virtual environment and install DeepSpeed with the following command:
DS_BUILD_OPS=0 pip install deepspeed
2. Configure Accelerate
You can easily set up your environment for DeepSpeed with accelerate config. It allows you to control basic DeepSpeed environment variables. You can also use command-line arguments for configuration. Here's how you can set up for ZeRO-2 stage using Accelerate:
Follow the prompts to select your environment settings, including using multi-GPU, enabling DeepSpeed, and setting ZeRO optimization stage to 2.
Your configuration will be saved in a YAML file, similar to the following example (path and values may vary):
3. Use in Your Scripts
toml Configuration File
deepspeed=true
andzero_stage=[zero_stage]
to your toml config file. Refer to the ZeRO-stage and Accelerate DeepSpeed documentation for more details.Bash Argument
--deepspeed --zero_stage=[zero_stage]
to your script's command line arguments.1, 2, and 3 can be [zero_stage]
CPU/NVMe offloading
Add this argument in your toml or bash/batch scripts arguments.
full_fp16 training
Note
This PR aims to improve training efficiency in multi-GPU setups. It's been tested only in Linux environments and specifically for multi-GPU configurations. The DeepSpeed supports in Accelerate is still experimental, so please keep this in mind and feel free to provide feedback or comments on this PR.
Test Done!!