-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails #4017
Comments
Reducing the block size from 1024 to 256 solves this issue in part. But loss scales are still constantly reduced |
What do you mean by bucket size? If you mean |
Try to use |
What if my gpu did not support bf16? Because llama-2 13B did not have this error when training using deepspeed. Only 7B kept getting this error. |
how many gpus you use to train? I use 1 or 2 is ok, while 3 or more will meet this issue. V100 fp16 |
I use 8 x A6000 gpu. Small number of gpu is working? That's weird. I tought bigger number leads to bigger batch and that means better stability. |
you can try, i found this in my training but don't know why |
I mean the actual size of the batch in bytes. When using the Huggingface Trainer example code e.g. you can specify "block_size" there (Part of the DataArgument eg. in https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py). In the paper for Llama2 they used 4096, default is normally at 1024 - 3 Epoch training worked with 256
Sadly not possible, as I am using Nvidia v100 GPUs that do not support bf16, so I have to use fp16
interesting, and you did not run into any OOMs when using zeRO 2 with 2 GPUs? |
32GB is enough for LORA-style finetune |
Do you mean |
Sorry I meant "block_size". I will edit my message to that now. |
Where exactly that param? I'm searching |
It is not part of the TrainingArguments. It groups the text in to chunks of the specified size. |
Oh, that... I can't change that. My data are longer than 256 tokens. Trimming the data will only make the training result worsened. |
Training with only 2 GPU also not solved the problem for me. |
Same issue, weired is, this issue actually happened with more than 98% probilities on V100, but..... |
You may be interested in this HF doc: https://huggingface.co/docs/transformers/v4.15.0/performance |
yeah, bf16 is priority to use,while V100 doesn't support |
Yes, if we have bf16, won't bothering this issue. |
@wxjiao do you some kind of official reference that confirms this? I'm unable to find anything apart from the fact that the config.json for llamav2-7b says that the dtype is |
I encountered same problems when training VIT,I set scale_window to a relative small value(eg,100),let the loss scale have opportunity to raise when decrease at some batchs.That sloved problem,and you may need to choose an appropriate scale_window.And i am,using V100 |
I can confirm, running the same script with identical configuration but using Nvidia Tesla A30 GPUs with BF16 enabled solves this issue. |
Any updates here? I also face this problem with my 8*V100 machines. |
None so far, v100 lead to overflow and huge loss in performance as by per latest evaluation.
The PullRequest ist also open and had no recent changes. |
Other options could be:
Both work according to my friends‘ practice. |
@scorixear - can you take a look and see if the changes in #4141 help? |
@loadams I have rerun my setup with the latest version of deepspeed (0.13.0) and noticed degraded performance of the model due to still present overflow of the loss scale. However, with the previous version I used (0.9.x) I couldn't train the Llama2 model beyond approximately 400 steps without running into the minimum loss scale issue explained above. This does not occure anymore as it seems to be stable around a loss scale of 64. So this update definitly did something. It is sad that models are therefore hardware depended, but I don't think this is an issue related to deepspeed. I would leave it at that here, maybe you can decide if this issue should remain open (as it is still an issue in general) or if it can be closed (as it isn't that related to deepspeed anymore). |
…ia `load_module_only` (microsoft#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: microsoft#4017 Fixes: microsoft#4944 and microsoft#4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Describe the bug
When training the LLaMA2 7B HF Model with deepspeed on a single-node multi-gpu setup,
the loss_scale gets decreased consistently to 1 (minimum) and exits with error.
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
This appears both with ZeRO Stage2 + CPU Offload and ZeRO Stage3 + CPU Offload.
To Reproduce
Expected behavior
Training Completes
ds_report output
System info (please complete the following information):
Launcher context
Launching with Deepspeed as follows:
Docker context
No Docker
Additional context
DS Config:
Slurm Setup
The text was updated successfully, but these errors were encountered: