Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP 0 loss immediately wtih LLAMA 2 #1763

Closed
4 tasks
winglian opened this issue Jul 24, 2023 · 21 comments
Closed
4 tasks

FSDP 0 loss immediately wtih LLAMA 2 #1763

winglian opened this issue Jul 24, 2023 · 21 comments
Assignees

Comments

@winglian
Copy link
Contributor

System Info

- `Accelerate` version: 0.22.0.dev0
- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 1007.76 GB
- GPU type: NVIDIA A100 80GB PCIe
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

training with current accelerate HEAD with FSDP results in 0 loss on step 0

{'loss': 0.0, 'learning_rate': 3.0000000000000004e-08, 'epoch': 0.0}                                                                                                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.0}                                                                                                                                                                                                                                   
{'loss': 0.0, 'learning_rate': 2.7e-07, 'epoch': 0.0}                                                                                                                                                                                                                                                  
  0%|                                                                                                                                                                                                                                                               | 3/6180 [00:11<6:08:58,  3.58s/it]

reinstalling only accelerate to 0.21.0 has the expected non-zero training loss

{'loss': 7.4443, 'learning_rate': 3.0000000000000004e-08, 'epoch': 0.0}                                                                                                                                                                                                                                
{'loss': 7.263, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.0}                                                                                                                                                                                                                                 
{'loss': 6.6462, 'learning_rate': 2.7e-07, 'epoch': 0.0}                                   

this likely stems from either #1745 or #1753 which were merged in the past 3 days and I wasn't having this issue on Thursday with accelerate installed from HEAD.

Expected behavior

training loss not to be 0.0 on the first step

@winglian
Copy link
Contributor Author

here are the settings I'm using in my trainer:

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

@winglian
Copy link
Contributor Author

Additional debugging isolates this to the use case where https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html is used. The loss behaves as expected it not enabled. But this should be a regression in accelerate since it was working before.

@sgugger
Copy link
Collaborator

sgugger commented Jul 24, 2023

cc @pacman100

@pacman100
Copy link
Contributor

Hello, A minimal reproducer would be helpful

@winglian
Copy link
Contributor Author

I'm able to further narrow down the commit that is causing the regression. SHA 2a289f6108e77a77a4efffb3f6316bc98538413b works correctly, and moving to SHA a6291e43b04a243d37146e34715adff28b3733b2 failed (which seems to be this PR: #1740)

@pacman100
Copy link
Contributor

pacman100 commented Jul 24, 2023

@muellerzr who worked on the PR #1740

@muellerzr muellerzr self-assigned this Jul 24, 2023
@muellerzr
Copy link
Collaborator

@winglian we need a reproducer to test this please, as that PR maintains the same defaults as PyTorch (and the old defaults) and you shouldn't have seen an issue.

@teknium1
Copy link

@winglian we need a reproducer to test this please, as that PR maintains the same defaults as PyTorch (and the old defaults) and you shouldn't have seen an issue.

It was reproduced by me and one other person using fresh installs of latest full release pytorch, transformers, etc.

What is a reproducer?

@muellerzr
Copy link
Collaborator

@teknium1 a reproducer is a minimal chunk of code we can run on our end to recreate the bug, to help us debug and verify we have fixed the issue when making a fix

@muellerzr
Copy link
Collaborator

muellerzr commented Jul 27, 2023

For the time being I can't reproduce this on my systems, due to an issue with bits-and-bytes 😕 bitsandbytes-foundation/bitsandbytes#620

So, for now it'll have to be I push some code and we see if it'll run I'm sorry 🙏

But, as a first try, can either @teknium1 or @winglian install accelerate with pip install git+https://github.com/huggingface/accelerate@autocast and see if that solution in it works?

@winglian
Copy link
Contributor Author

winglian commented Jul 28, 2023

That branch didn't seem to work either. Just going to add for posterity the notes I posted in Twitter about this only being an issue with Llama-2. This doesn't seem to be an issue with legacy Llama as simply changing back to that model doesn't seem to have this problem.

@muellerzr muellerzr changed the title FSDP 0 loss immediately FSDP 0 loss immediately wtih LLAMA 2 Aug 3, 2023
@winglian
Copy link
Contributor Author

winglian commented Aug 8, 2023

@muellerzr commit SHA a6291e43b04a243d37146e34715adff28b3733b2 is definitely problematic on many fronts. I've narrowed the previous behavior to using packed sequences with torch's scaled dot product attention too.

When I switched to using flash attention v2 (still using packed sequences), every commit up to HEAD is less performant than 2a289f6108e77a77a4efffb3f6316bc98538413b and stepping backwards through the commit history again shows a6291e43b04a243d37146e34715adff28b3733b2 is the culprit. In fact, for the exact same workload, that commit OOMs after two steps on the backwards pass (and is also slower). I am able to prevent it from OOM'ing by having to set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:2048 (tried values between 512 and 2048), but that causes even slower performances still.

Screenshot 2023-08-08 at 9 39 51 AM

@muellerzr
Copy link
Collaborator

@winglian again any chance you could share the code you are using for these for me to be able to test against? Otherwise I'm flying blind at finding the problems which makes it rather difficult.

@muellerzr
Copy link
Collaborator

muellerzr commented Aug 8, 2023

Otherwise, since we may have to do a back and forth, try building via pip install git+https://github.com/huggingface/accelerate@autocast-fix, and I'll ping in here when to rebuild (On further commits than the two I just pushed)

@winglian
Copy link
Contributor Author

winglian commented Aug 8, 2023

@muellerzr it's this trainer https://github.com/openaccess-ai-collective/axolotl on this branch packing-attn-limit-fa2-rebased. I'm using this config https://gist.github.com/winglian/e803bf5e305d893de3e68b051189f346 . My apologies as this particular "edge case" has a lot of nuance to it so it's hard to boil it down to a simple reproducing script.

I'll give that branch a try shortly and report back.

@winglian
Copy link
Contributor Author

winglian commented Aug 8, 2023

the autocast-fix branch results in

  File "/workspace/axolotl/scripts/finetune.py", line 361, in train                                                                                                                                                                                                            
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1641, in _inner_training_loop
    self.model = self.accelerator.prepare(self.model)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1271, in prepare          
    result = tuple(         
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1272, in <genexpr>            
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1084, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1371, in prepare_model  
    new_forward = autocast_context(model_forward_func)                                                                                 
TypeError: 'nullcontext' object is not callable                                                                                        

@winglian
Copy link
Contributor Author

winglian commented Aug 8, 2023

@muellerzr Is there a reason DistributedType.FSDP should not be put into this clause https://github.com/huggingface/accelerate/blob/autocast-fix/src/accelerate/utils/modeling.py#L1430-L1435 ?
Adding it there seems to have brought some parity with Flash Attention 2 so far as far as speed goes, but the losses are incorrect compared to the prior commit, so I guess that answers my own question :D

@muellerzr
Copy link
Collaborator

That error actually helps a ton which is excellent :) I'll push some more fixes tommorow morning for us to try

@muellerzr
Copy link
Collaborator

@winglian let's try again please :) You should be able to just run the same pip install command

@winglian
Copy link
Contributor Author

@muellerzr new error

  File "/workspace/axolotl/scripts/finetune.py", line 361, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1641, in _inner_training_loop
    self.model = self.accelerator.prepare(self.model)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1271, in prepare
    result = tuple(       
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1272, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1084, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)                     
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1373, in prepare_model
    model.forward = MethodType(new_forward, model)
TypeError: first argument must be callable
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1123693) of binary: /root/miniconda3/envs/py3.10/bin/python

@github-actions
Copy link

github-actions bot commented Sep 4, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants