FSDP 0 loss immediately wtih LLAMA 2 #1763

winglian · 2023-07-24T15:18:21Z

System Info

- `Accelerate` version: 0.22.0.dev0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 1007.76 GB
- GPU type: NVIDIA A100 80GB PCIe
- `Accelerate` default config:
        Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

training with current accelerate HEAD with FSDP results in 0 loss on step 0

{'loss': 0.0, 'learning_rate': 3.0000000000000004e-08, 'epoch': 0.0}                                                                                                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.0}                                                                                                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 2.7e-07, 'epoch': 0.0}                                                                                                                                                                                                                                                  
  0%|                                                                                                                                                                                                                                                               | 3/6180 [00:11<6:08:58,  3.58s/it]

reinstalling only accelerate to 0.21.0 has the expected non-zero training loss

{'loss': 7.4443, 'learning_rate': 3.0000000000000004e-08, 'epoch': 0.0}                                                                                                                                                                                                                                
{'loss': 7.263, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.0}                                                                                                                                                                                                                                 
{'loss': 6.6462, 'learning_rate': 2.7e-07, 'epoch': 0.0}

this likely stems from either #1745 or #1753 which were merged in the past 3 days and I wasn't having this issue on Thursday with accelerate installed from HEAD.

Expected behavior

training loss not to be 0.0 on the first step

The text was updated successfully, but these errors were encountered:

winglian · 2023-07-24T15:22:07Z

here are the settings I'm using in my trainer:

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

winglian · 2023-07-24T15:35:57Z

Additional debugging isolates this to the use case where https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html is used. The loss behaves as expected it not enabled. But this should be a regression in accelerate since it was working before.

sgugger · 2023-07-24T15:43:14Z

cc @pacman100

pacman100 · 2023-07-24T16:48:18Z

Hello, A minimal reproducer would be helpful

winglian · 2023-07-24T16:56:01Z

I'm able to further narrow down the commit that is causing the regression. SHA 2a289f6108e77a77a4efffb3f6316bc98538413b works correctly, and moving to SHA a6291e43b04a243d37146e34715adff28b3733b2 failed (which seems to be this PR: #1740)

pacman100 · 2023-07-24T17:49:59Z

@muellerzr who worked on the PR #1740

muellerzr · 2023-07-24T18:31:27Z

@winglian we need a reproducer to test this please, as that PR maintains the same defaults as PyTorch (and the old defaults) and you shouldn't have seen an issue.

teknium1 · 2023-07-26T08:27:35Z

@winglian we need a reproducer to test this please, as that PR maintains the same defaults as PyTorch (and the old defaults) and you shouldn't have seen an issue.

It was reproduced by me and one other person using fresh installs of latest full release pytorch, transformers, etc.

What is a reproducer?

muellerzr · 2023-07-26T08:33:10Z

@teknium1 a reproducer is a minimal chunk of code we can run on our end to recreate the bug, to help us debug and verify we have fixed the issue when making a fix

muellerzr · 2023-07-27T16:44:41Z

For the time being I can't reproduce this on my systems, due to an issue with bits-and-bytes 😕 bitsandbytes-foundation/bitsandbytes#620

So, for now it'll have to be I push some code and we see if it'll run I'm sorry 🙏

But, as a first try, can either @teknium1 or @winglian install accelerate with pip install git+https://github.com/huggingface/accelerate@autocast and see if that solution in it works?

winglian · 2023-07-28T15:07:46Z

That branch didn't seem to work either. Just going to add for posterity the notes I posted in Twitter about this only being an issue with Llama-2. This doesn't seem to be an issue with legacy Llama as simply changing back to that model doesn't seem to have this problem.

winglian · 2023-08-08T13:44:07Z

@muellerzr commit SHA a6291e43b04a243d37146e34715adff28b3733b2 is definitely problematic on many fronts. I've narrowed the previous behavior to using packed sequences with torch's scaled dot product attention too.

When I switched to using flash attention v2 (still using packed sequences), every commit up to HEAD is less performant than 2a289f6108e77a77a4efffb3f6316bc98538413b and stepping backwards through the commit history again shows a6291e43b04a243d37146e34715adff28b3733b2 is the culprit. In fact, for the exact same workload, that commit OOMs after two steps on the backwards pass (and is also slower). I am able to prevent it from OOM'ing by having to set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:2048 (tried values between 512 and 2048), but that causes even slower performances still.

muellerzr · 2023-08-08T14:03:17Z

@winglian again any chance you could share the code you are using for these for me to be able to test against? Otherwise I'm flying blind at finding the problems which makes it rather difficult.

muellerzr · 2023-08-08T14:11:01Z

Otherwise, since we may have to do a back and forth, try building via pip install git+https://github.com/huggingface/accelerate@autocast-fix, and I'll ping in here when to rebuild (On further commits than the two I just pushed)

winglian · 2023-08-08T19:40:21Z

@muellerzr it's this trainer https://github.com/openaccess-ai-collective/axolotl on this branch packing-attn-limit-fa2-rebased. I'm using this config https://gist.github.com/winglian/e803bf5e305d893de3e68b051189f346 . My apologies as this particular "edge case" has a lot of nuance to it so it's hard to boil it down to a simple reproducing script.

I'll give that branch a try shortly and report back.

winglian · 2023-08-08T20:02:58Z

the autocast-fix branch results in

  File "/workspace/axolotl/scripts/finetune.py", line 361, in train                                                                                                                                                                                                            
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1641, in _inner_training_loop
    self.model = self.accelerator.prepare(self.model)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1271, in prepare          
    result = tuple(         
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1272, in <genexpr>            
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1084, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1371, in prepare_model  
    new_forward = autocast_context(model_forward_func)                                                                                 
TypeError: 'nullcontext' object is not callable

winglian · 2023-08-08T21:12:06Z

@muellerzr Is there a reason DistributedType.FSDP should not be put into this clause https://github.com/huggingface/accelerate/blob/autocast-fix/src/accelerate/utils/modeling.py#L1430-L1435 ?
Adding it there seems to have brought some parity with Flash Attention 2 so far as far as speed goes, but the losses are incorrect compared to the prior commit, so I guess that answers my own question :D

muellerzr · 2023-08-08T22:30:46Z

That error actually helps a ton which is excellent :) I'll push some more fixes tommorow morning for us to try

muellerzr · 2023-08-09T14:40:46Z

@winglian let's try again please :) You should be able to just run the same pip install command

winglian · 2023-08-10T15:21:23Z

@muellerzr new error

  File "/workspace/axolotl/scripts/finetune.py", line 361, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1641, in _inner_training_loop
    self.model = self.accelerator.prepare(self.model)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1271, in prepare
    result = tuple(       
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1272, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1084, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)                     
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1373, in prepare_model
    model.forward = MethodType(new_forward, model)
TypeError: first argument must be callable
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1123693) of binary: /root/miniconda3/envs/py3.10/bin/python

github-actions · 2023-09-04T15:05:51Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

winglian mentioned this issue Jul 24, 2023

latest HEAD of accelerate causes 0 loss immediately w FSDP axolotl-ai-cloud/axolotl#321

Merged

muellerzr self-assigned this Jul 24, 2023

muellerzr changed the title ~~FSDP 0 loss immediately~~ FSDP 0 loss immediately wtih LLAMA 2 Aug 3, 2023

github-actions bot closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP 0 loss immediately wtih LLAMA 2 #1763

FSDP 0 loss immediately wtih LLAMA 2 #1763

winglian commented Jul 24, 2023

winglian commented Jul 24, 2023

winglian commented Jul 24, 2023

sgugger commented Jul 24, 2023

pacman100 commented Jul 24, 2023

winglian commented Jul 24, 2023

pacman100 commented Jul 24, 2023 •

edited

Loading

muellerzr commented Jul 24, 2023

teknium1 commented Jul 26, 2023

muellerzr commented Jul 26, 2023

muellerzr commented Jul 27, 2023 •

edited

Loading

winglian commented Jul 28, 2023 •

edited

Loading

winglian commented Aug 8, 2023

muellerzr commented Aug 8, 2023

muellerzr commented Aug 8, 2023 •

edited

Loading

winglian commented Aug 8, 2023

winglian commented Aug 8, 2023

winglian commented Aug 8, 2023 •

edited

Loading

muellerzr commented Aug 8, 2023

muellerzr commented Aug 9, 2023

winglian commented Aug 10, 2023

github-actions bot commented Sep 4, 2023

FSDP 0 loss immediately wtih LLAMA 2 #1763

FSDP 0 loss immediately wtih LLAMA 2 #1763

Comments

winglian commented Jul 24, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

winglian commented Jul 24, 2023

winglian commented Jul 24, 2023

sgugger commented Jul 24, 2023

pacman100 commented Jul 24, 2023

winglian commented Jul 24, 2023

pacman100 commented Jul 24, 2023 • edited Loading

muellerzr commented Jul 24, 2023

teknium1 commented Jul 26, 2023

muellerzr commented Jul 26, 2023

muellerzr commented Jul 27, 2023 • edited Loading

winglian commented Jul 28, 2023 • edited Loading

winglian commented Aug 8, 2023

muellerzr commented Aug 8, 2023

muellerzr commented Aug 8, 2023 • edited Loading

winglian commented Aug 8, 2023

winglian commented Aug 8, 2023

winglian commented Aug 8, 2023 • edited Loading

muellerzr commented Aug 8, 2023

muellerzr commented Aug 9, 2023

winglian commented Aug 10, 2023

github-actions bot commented Sep 4, 2023

pacman100 commented Jul 24, 2023 •

edited

Loading

muellerzr commented Jul 27, 2023 •

edited

Loading

winglian commented Jul 28, 2023 •

edited

Loading

muellerzr commented Aug 8, 2023 •

edited

Loading

winglian commented Aug 8, 2023 •

edited

Loading