-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixtral 16bit LoRa OOM with deepspeed zero stage 3 and dpo trainer on 4 80GB A100s #1268
Comments
It seems that the mixtral model is loaded to your every GPU rather than partitioned to your GPU (A100) equally. model = AutoModelForCausalLM.from_pretrained(
model=$model_name_or_model_path,
torch_dtype=torch.float16,
) Try this as above. |
Thanks a lot for the isue! |
@younesbelkada Thanks for this solution. I am using accelerate multi-gpu config and it is working well for Mixtral using DPO. My GPUs are 8 A-100 40G. However, It goes OOM if seq length is larger than 1024 which is small, I need at least 2048. I have enabled gradient checkpointing, decreasing batch size to 1, and using paged adamw 8bit. Still, it goes OOM. Is there any thing else I can do? I am not sure if multi-gpu config allows for CPU offload like deepspeed. I really appreciate if you could help? Thanks |
@janphilippfranken Did deepspeed work for you? It does not work for me. |
not with mixtral; so i also ended up using device map auto and just run
python train.py for mixtral (which seems v inefficient?)
for mistral etc it does.
…On Thu, 1 Feb 2024 at 8:49 am, Saeed Khaki ***@***.***> wrote:
@janphilippfranken <https://github.com/janphilippfranken> Did deepspeed
work for you? It does not work for me.
—
Reply to this email directly, view it on GitHub
<#1268 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AM6ZNJPDE2HGDQSESVOFS4LYRPBSXAVCNFSM6AAAAABCGPLGDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRRG43DGNBXHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I see, but it becomes very slow. It does not utilize all capacity of GPUs, e.g. the GPU utilization is low. |
@saeedkhaki92 for decreasing the memory footprint of the training of your model you might consider using flash-attention 2 , simply pass |
@younesbelkada Thanks. It still goes OOM, I added attn_implementation="flash_attention_2" and setting use_cache=False. This is my training scripts and how I call it:
And this is part of my code where I load the mixtral model inside my script: rlhf_dpo_4bit.py
trl version: 0.7.11.dev0 @younesbelkada Could you please let us know if there is any other way around this? Like CPU offloading, as far as I know, accelerate does not have cpu offload options. I tried deepspeed but getting errors. Thanks a lot |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
@younesbelkada
My understanding is that with zero2+offloading, we should not go OOM because of excessive memory would be assigned to CPU. I appreciate it if you could comment on this? Thanks |
Hi @saeedkhaki92 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
@younesbelkada Just a quick update, I managed to get it working with zero3+offloading, and by adding:
it significantly reduced the memory usage. Per documentation of DeepSpeed: set_z3_leaf_modules is particularly useful in the context of Mixture of Experts (MoE) models. In MoE models, the computation order of experts varies across forward passes. This variability can disrupt ZeRO3's functionality, as ZeRO3 relies on tracking the computation order of modules to prefetch parameters efficiently. By designating a module as a 'leaf' node, ZeRO3 will prefetch parameters for all child modules upon entering the module. |
very nice thanks for sharing! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
hi!
i am trying to use the dpo trainer to fine-tune a mixtral 8*7B model in 16bit precision (i've already completed fine-tuning for a 4bit model without issues, but unfortunately the quantized adapter performs worse than the 16bit version of the model which i want to compare it to).
my goal is to complete training an adapter in 16bit precision, and then merge and unload the model with the adapter to run inference with vllm using the merged model.
unfortunately, i am running into OOM issues when trying to run
dpo_trainer.train()
for the following setup (any help would be much appreciated):deepspeed config: (from https://huggingface.co/blog/accelerate-deepspeed)
accelerate config
training script:
Hardware: 4 80GB A100 GPUs
Command:
accelerate launch --config_file accelerate_config.yaml train_dpo.py
error
File "/scr/jphilipp/miniconda3/envs/scai-tuning/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1210, in all_gather_coalesced
param_buffer = torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 2 has a total capacty of 79.15 GiB of which 171.25 MiB is free. Including non-PyTorch memory, this process has 78.87 GiB memory in use. Of the allocated memory 76.47 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The text was updated successfully, but these errors were encountered: