[BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails #4017

scorixear · 2023-07-22T12:48:47Z

Describe the bug
When training the LLaMA2 7B HF Model with deepspeed on a single-node multi-gpu setup,
the loss_scale gets decreased consistently to 1 (minimum) and exits with error.

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

This appears both with ZeRO Stage2 + CPU Offload and ZeRO Stage3 + CPU Offload.

To Reproduce

DeepSpeed with ZeRO Stage 2 + CPU offload.
HF Trainer (v4.32.0.dev0)

Expected behavior
Training Completes

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch']
torch version .................... 1.12.1
deepspeed install path ........... ['/home/<BLANKED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7

System info (please complete the following information):

OS: Rocky Linux 8
GPU count and types: 1 Machine, 4x Nvidia Tesla V100
Python version: 3.10.8
HuggingFace Transformers: 4.32.0.dev0

Launcher context
Launching with Deepspeed as follows:

srun --jobid 3530272 bash -c "NCCL_DEBUG=INFO deepspeed 
--num_gpus=4 
03_train_llama2.py 
--model_name meta-llama/Llama-2-7b-hf 
--cache_dir ./cache 
--use_fast_tokenizer false 
--model_revision main 
--use_auth_token true 
--hugging_token <BLANKED>
--torch_dtype auto 
--low_cpu_mem_usage false 
--train_file ./input/health_information_systems_epub.md 
--max_train_samples 1000 
--overwrite_cache false 
--validation_split_percentage 5 
--preprocessing_num_workers 1 
--keep_linebreaks true 
--output_dir ./trained/7B 
--overwrite_output_dir false 
--do_train true 
--do_eval false 
--per_device_train_batch_size 1 
--per_device_eval_batch_size 1 
--evaluation_strategy steps 
--eval_steps 100 
--learning_rate 3e-4 
--weight_decay 0.1 
--adam_beta1 0.9 
--adam_beta2 0.95 
--adam_epsilon 1e-8 
--max_grad_norm 1.0 
--num_train_epochs 3 
--lr_scheduler_type cosine 
--warmup_steps 0 
--log_level passive 
--save_strategy steps 
--save_steps 500 
--save_total_limit 1 
--no_cuda false 
--seed 42 
--fp16 true 
--bf16 false 
--half_precision_backend auto 
--local_rank 0 
--ddp_backend nccl 
--deepspeed ./ds_configs/stage2_offload.json 
--optim adamw_torch"

Docker context
No Docker

Additional context
DS Config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": { 
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "allgather_bucket_size": 2e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "gradient_clipping": 1.0,
    "steps_per_print": 500,
    "wall_clock_breakdown": false,
    "train_micro_batch_size_per_gpu": 1
}

Slurm Setup

#SBATCH --job-name=deepspeed-llama2-7b-hf        # name
#SBATCH --nodes=1                                                 # nodes
#SBATCH --ntasks-per-node=1                                 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=4
#SBATCH --partition=clara
#SBATCH --mem=256G                                            # 128G was not enough
#SBATCH --gres=gpu:v100:4                                    # number of gpus
#SBATCH --output=logs/%x-%j.out                           # output file name

The text was updated successfully, but these errors were encountered:

scorixear · 2023-07-23T20:50:02Z

Reducing the block size from 1024 to 256 solves this issue in part.
Meaning if the dataset used is small enough, the loss scale gets not recuded fully to 1.

But loss scales are still constantly reduced

fahadh4ilyas · 2023-07-26T10:15:07Z

Reducing the bucket size from 1024 to 256 solves this issue in part. Meaning if the dataset used is small enough, the loss scale gets not recuded fully to 1.

But loss scales are still constantly reduced

What do you mean by bucket size? If you mean reduce_bucket_size, the value is not 1024 but 2e8 (200 million).

wxjiao · 2023-07-28T08:12:52Z

Try to use bf16 as LLaMA-2 was pretrained using bf16. Continuing the training with fp16 will be problematic.

fahadh4ilyas · 2023-07-28T08:17:39Z

Try to use bf16 as LLaMA-2 was pretrained using bf16. Continuing the training with fp16 will be problematic.

What if my gpu did not support bf16? Because llama-2 13B did not have this error when training using deepspeed. Only 7B kept getting this error.

FrankWhh · 2023-07-29T04:27:09Z

how many gpus you use to train? I use 1 or 2 is ok, while 3 or more will meet this issue. V100 fp16

fahadh4ilyas · 2023-07-29T04:33:59Z

how many gpus you use to train? I use 1 or 2 is ok, while 3 or more will meet this issue. V100 fp16

I use 8 x A6000 gpu. Small number of gpu is working? That's weird. I tought bigger number leads to bigger batch and that means better stability.

FrankWhh · 2023-07-29T04:38:35Z

you can try, i found this in my training but don't know why

scorixear · 2023-07-29T08:10:58Z

@fahadh4ilyas

Reducing the bucket size from 1024 to 256 solves this issue in part. Meaning if the dataset used is small enough, the loss scale gets not recuded fully to 1.
But loss scales are still constantly reduced

What do you mean by bucket size? If you mean reduce_bucket_size, the value is not 1024 but 2e8 (200 million).

I mean the actual size of the batch in bytes. When using the Huggingface Trainer example code e.g. you can specify "block_size" there (Part of the DataArgument eg. in https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py). In the paper for Llama2 they used 4096, default is normally at 1024 - 3 Epoch training worked with 256
for higher Epochs, I have to go down even lower.

@wxjiao

Try to use bf16 as LLaMA-2 was pretrained using bf16. Continuing the training with fp16 will be problematic.

Sadly not possible, as I am using Nvidia v100 GPUs that do not support bf16, so I have to use fp16

@FrankWhh

how many gpus you use to train? I use 1 or 2 is ok, while 3 or more will meet this issue. V100 fp16

interesting, and you did not run into any OOMs when using zeRO 2 with 2 GPUs?

FrankWhh · 2023-07-29T08:20:29Z

32GB is enough for LORA-style finetune

fahadh4ilyas · 2023-07-29T09:15:30Z

@fahadh4ilyas

Reducing the bucket size from 1024 to 256 solves this issue in part. Meaning if the dataset used is small enough, the loss scale gets not recuded fully to 1.
But loss scales are still constantly reduced

What do you mean by bucket size? If you mean reduce_bucket_size, the value is not 1024 but 2e8 (200 million).

I mean the actual size of the batch in bytes. When using the Huggingface Trainer e.g. you can specify "bucket_size" there. In the paper for Llama2 they used 4096, default is normally at 1024 - 3 Epoch training worked with 256
for higher Epochs, I have to go down even lower.

Do you mean ddp_bucket_cap_mb parameter? It seems the default value of it is None. how do you know the default is 1024? And does it affect performance?

scorixear · 2023-07-29T09:35:51Z

@fahadh4ilyas

Reducing the bucket size from 1024 to 256 solves this issue in part. Meaning if the dataset used is small enough, the loss scale gets not recuded fully to 1.
But loss scales are still constantly reduced

What do you mean by bucket size? If you mean reduce_bucket_size, the value is not 1024 but 2e8 (200 million).

I mean the actual size of the batch in bytes. When using the Huggingface Trainer e.g. you can specify "bucket_size" there. In the paper for Llama2 they used 4096, default is normally at 1024 - 3 Epoch training worked with 256
for higher Epochs, I have to go down even lower.

Do you mean ddp_bucket_cap_mb parameter? It seems the default value of it is None. how do you know the default is 1024? And does it affect performance?

Sorry I meant "block_size". I will edit my message to that now.

fahadh4ilyas · 2023-07-29T09:41:52Z

@fahadh4ilyas

Reducing the bucket size from 1024 to 256 solves this issue in part. Meaning if the dataset used is small enough, the loss scale gets not recuded fully to 1.
But loss scales are still constantly reduced

What do you mean by bucket size? If you mean reduce_bucket_size, the value is not 1024 but 2e8 (200 million).

I mean the actual size of the batch in bytes. When using the Huggingface Trainer e.g. you can specify "bucket_size" there. In the paper for Llama2 they used 4096, default is normally at 1024 - 3 Epoch training worked with 256
for higher Epochs, I have to go down even lower.

Do you mean ddp_bucket_cap_mb parameter? It seems the default value of it is None. how do you know the default is 1024? And does it affect performance?

Sorry I meant "block_size". I will edit my message to that now.

Where exactly that param? I'm searching TrainingArguments but it has no "block_size".

scorixear · 2023-07-29T09:55:39Z

Sorry I meant "block_size". I will edit my message to that now.

Where exactly that param? I'm searching TrainingArguments but it has no "block_size".

It is not part of the TrainingArguments.
You usually have to implement it yourself - I used the code from https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py

It groups the text in to chunks of the specified size.

fahadh4ilyas · 2023-07-29T09:59:37Z

Sorry I meant "block_size". I will edit my message to that now.

Where exactly that param? I'm searching TrainingArguments but it has no "block_size".

It is not part of the TrainingArguments.
You usually have to implement it yourself - I used the code from https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py

It groups the text in to chunks of the specified size.

Oh, that... I can't change that. My data are longer than 256 tokens. Trimming the data will only make the training result worsened.

fahadh4ilyas · 2023-07-30T06:09:29Z

you can try, i found this in my training but don't know why

Training with only 2 GPU also not solved the problem for me.

lucasjinreal · 2023-08-02T07:18:40Z

Same issue, weired is, this issue actually happened with more than 98% probilities on V100, but.....

wxjiao · 2023-08-02T07:46:09Z

You may be interested in this HF doc: https://huggingface.co/docs/transformers/v4.15.0/performance

FrankWhh · 2023-08-02T07:55:15Z

yeah, bf16 is priority to use，while V100 doesn't support

lucasjinreal · 2023-08-02T11:14:56Z

Yes, if we have bf16, won't bothering this issue.

RameshArvind · 2023-08-09T19:05:14Z

Try to use bf16 as LLaMA-2 was pretrained using bf16. Continuing the training with fp16 will be problematic.

@wxjiao do you some kind of official reference that confirms this? I'm unable to find anything apart from the fact that the config.json for llamav2-7b says that the dtype is float16

chenfengshijie · 2023-08-17T06:03:32Z

I encountered same problems when training VIT,I set scale_window to a relative small value(eg,100),let the loss scale have opportunity to raise when decrease at some batchs.That sloved problem,and you may need to choose an appropriate scale_window.And i am,using V100

scorixear · 2023-08-19T19:17:12Z

Try to use bf16 as LLaMA-2 was pretrained using bf16. Continuing the training with fp16 will be problematic.

You may be interested in this HF doc: https://huggingface.co/docs/transformers/v4.15.0/performance

yeah, bf16 is priority to use，while V100 doesn't support

Yes, if we have bf16, won't bothering this issue.

I can confirm, running the same script with identical configuration but using Nvidia Tesla A30 GPUs with BF16 enabled solves this issue.
There are no overflows occuring anymore.

YuFan-Microsoft · 2023-09-28T03:55:58Z

Any updates here? I also face this problem with my 8*V100 machines.

scorixear · 2023-10-10T09:34:19Z

Any updates here? I also face this problem with my 8*V100 machines.

None so far, v100 lead to overflow and huge loss in performance as by per latest evaluation.
For now I think there are three options:

not use V100 if Model with BF16 and DeepSpeed required
not use DeepSpeed if Model with BF16 and V100 required
not use Model with BF16 if V100 and DeepSpeed required

The PullRequest ist also open and had no recent changes.
@YuFan-Microsoft

wxjiao · 2023-10-13T03:23:57Z

@scorixear

Other options could be:

Use llama-recipes (https://github.com/facebookresearch/llama-recipes) with V100 + fp16, FSDP
Use V100 + fp32 for 7b/13b with small batch sizes, DeepSpeed

Both work according to my friends‘ practice.

loadams · 2024-01-18T01:27:15Z

@scorixear - can you take a look and see if the changes in #4141 help?

scorixear · 2024-01-23T21:49:48Z

@scorixear - can you take a look and see if the changes in #4141 help?

@loadams I have rerun my setup with the latest version of deepspeed (0.13.0) and noticed degraded performance of the model due to still present overflow of the loss scale.

However, with the previous version I used (0.9.x) I couldn't train the Llama2 model beyond approximately 400 steps without running into the minimum loss scale issue explained above.

This does not occure anymore as it seems to be stable around a loss scale of 64. So this update definitly did something.
Degraded performance of models previously trained on bf16 and now trained on fp16 is somewhat expected as pointed out in comments here and in the transformers documentation from huggingface.

It is sad that models are therefore hardware depended, but I don't think this is an issue related to deepspeed.

I would leave it at that here, maybe you can decide if this issue should remain open (as it is still an issue in general) or if it can be closed (as it isn't that related to deepspeed anymore).

@Quentin-Anthony

…ia `load_module_only` (microsoft#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: microsoft#4017 Fixes: microsoft#4944 and microsoft#4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

scorixear added bug Something isn't working training labels Jul 22, 2023

lucasjinreal mentioned this issue Aug 2, 2023

Llama2 sft on V100 keep get inf loss, why? meta-llama/llama-recipes#86

Closed

haileyschoelkopf mentioned this issue Aug 15, 2023

Fixes for training models with bf16 + freshly initialized optimizer via load_module_only #4141

Merged

azhx mentioned this issue Nov 2, 2023

Learning rate fixed at 0 during training via DeepSpeed AetherCortex/Llama-X#30

Open

leonunix mentioned this issue Dec 28, 2023

DeepSpeed提示Loss scale already at minimum，然后就自动退出。 ymcui/Chinese-LLaMA-Alpaca-2#486

Closed

3 tasks

loadams self-assigned this Jan 18, 2024

scorixear mentioned this issue Jan 23, 2024

How to fix unstable loss. I am using wizardlm or Llama-X training code with vicuna style chat format for fine-tuning Llama-2-7b-hf model. AetherCortex/Llama-X#32

Open

mrwyattii closed this as completed Jan 25, 2024

sam-h-bean mentioned this issue Jun 8, 2024

[BUG] zero_quantized_nontrainable_weights=True when using PEFT+DeepSpeed with Mixed-Precision training using BF16 leads to float != c10::BFloat16 error #4885

Closed

Daming-W mentioned this issue Jul 2, 2024

require_grad TinyLLaVA/TinyLLaVA_Factory#81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails #4017

[BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails #4017

scorixear commented Jul 22, 2023 •

edited

Loading

scorixear commented Jul 23, 2023 •

edited

Loading

fahadh4ilyas commented Jul 26, 2023

wxjiao commented Jul 28, 2023

fahadh4ilyas commented Jul 28, 2023

FrankWhh commented Jul 29, 2023

fahadh4ilyas commented Jul 29, 2023

FrankWhh commented Jul 29, 2023

scorixear commented Jul 29, 2023 •

edited

Loading

FrankWhh commented Jul 29, 2023

fahadh4ilyas commented Jul 29, 2023

scorixear commented Jul 29, 2023

fahadh4ilyas commented Jul 29, 2023

scorixear commented Jul 29, 2023

fahadh4ilyas commented Jul 29, 2023

fahadh4ilyas commented Jul 30, 2023

lucasjinreal commented Aug 2, 2023

wxjiao commented Aug 2, 2023

FrankWhh commented Aug 2, 2023

lucasjinreal commented Aug 2, 2023

RameshArvind commented Aug 9, 2023 •

edited

Loading

chenfengshijie commented Aug 17, 2023 •

edited

Loading

scorixear commented Aug 19, 2023

YuFan-Microsoft commented Sep 28, 2023

scorixear commented Oct 10, 2023

wxjiao commented Oct 13, 2023

loadams commented Jan 18, 2024 •

edited

Loading

scorixear commented Jan 23, 2024

[BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails #4017

[BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails #4017

Comments

scorixear commented Jul 22, 2023 • edited Loading

scorixear commented Jul 23, 2023 • edited Loading

fahadh4ilyas commented Jul 26, 2023

wxjiao commented Jul 28, 2023

fahadh4ilyas commented Jul 28, 2023

FrankWhh commented Jul 29, 2023

fahadh4ilyas commented Jul 29, 2023

FrankWhh commented Jul 29, 2023

scorixear commented Jul 29, 2023 • edited Loading

FrankWhh commented Jul 29, 2023

fahadh4ilyas commented Jul 29, 2023

scorixear commented Jul 29, 2023

fahadh4ilyas commented Jul 29, 2023

scorixear commented Jul 29, 2023

fahadh4ilyas commented Jul 29, 2023

fahadh4ilyas commented Jul 30, 2023

lucasjinreal commented Aug 2, 2023

wxjiao commented Aug 2, 2023

FrankWhh commented Aug 2, 2023

lucasjinreal commented Aug 2, 2023

RameshArvind commented Aug 9, 2023 • edited Loading

chenfengshijie commented Aug 17, 2023 • edited Loading

scorixear commented Aug 19, 2023

YuFan-Microsoft commented Sep 28, 2023

scorixear commented Oct 10, 2023

wxjiao commented Oct 13, 2023

loadams commented Jan 18, 2024 • edited Loading

scorixear commented Jan 23, 2024

scorixear commented Jul 22, 2023 •

edited

Loading

scorixear commented Jul 23, 2023 •

edited

Loading

scorixear commented Jul 29, 2023 •

edited

Loading

RameshArvind commented Aug 9, 2023 •

edited

Loading

chenfengshijie commented Aug 17, 2023 •

edited

Loading

loadams commented Jan 18, 2024 •

edited

Loading