-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP 0 loss immediately wtih LLAMA 2 #1763
Comments
here are the settings I'm using in my trainer:
|
Additional debugging isolates this to the use case where https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html is used. The loss behaves as expected it not enabled. But this should be a regression in accelerate since it was working before. |
cc @pacman100 |
Hello, A minimal reproducer would be helpful |
I'm able to further narrow down the commit that is causing the regression. SHA |
@muellerzr who worked on the PR #1740 |
@winglian we need a reproducer to test this please, as that PR maintains the same defaults as PyTorch (and the old defaults) and you shouldn't have seen an issue. |
It was reproduced by me and one other person using fresh installs of latest full release pytorch, transformers, etc. What is a reproducer? |
@teknium1 a reproducer is a minimal chunk of code we can run on our end to recreate the bug, to help us debug and verify we have fixed the issue when making a fix |
For the time being I can't reproduce this on my systems, due to an issue with bits-and-bytes 😕 bitsandbytes-foundation/bitsandbytes#620 So, for now it'll have to be I push some code and we see if it'll run I'm sorry 🙏 But, as a first try, can either @teknium1 or @winglian install accelerate with |
That branch didn't seem to work either. Just going to add for posterity the notes I posted in Twitter about this only being an issue with Llama-2. This doesn't seem to be an issue with legacy Llama as simply changing back to that model doesn't seem to have this problem. |
@muellerzr commit SHA When I switched to using flash attention v2 (still using packed sequences), every commit up to |
@winglian again any chance you could share the code you are using for these for me to be able to test against? Otherwise I'm flying blind at finding the problems which makes it rather difficult. |
Otherwise, since we may have to do a back and forth, try building via |
@muellerzr it's this trainer https://github.com/openaccess-ai-collective/axolotl on this branch packing-attn-limit-fa2-rebased. I'm using this config https://gist.github.com/winglian/e803bf5e305d893de3e68b051189f346 . My apologies as this particular "edge case" has a lot of nuance to it so it's hard to boil it down to a simple reproducing script. I'll give that branch a try shortly and report back. |
the autocast-fix branch results in
|
@muellerzr Is there a reason |
That error actually helps a ton which is excellent :) I'll push some more fixes tommorow morning for us to try |
@winglian let's try again please :) You should be able to just run the same pip install command |
@muellerzr new error
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
training with current accelerate HEAD with FSDP results in 0 loss on step 0
reinstalling only accelerate to
0.21.0
has the expected non-zero training lossthis likely stems from either #1745 or #1753 which were merged in the past 3 days and I wasn't having this issue on Thursday with accelerate installed from HEAD.
Expected behavior
training loss not to be 0.0 on the first step
The text was updated successfully, but these errors were encountered: