-
-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New changes in PEFT breaks FSDP (RLHF) #1836
Comments
for whomever looks into this, the error is we usually run |
also, here's the diff on transformers from the changes in the commit that works huggingface/transformers@0fdea86...v4.43.1 |
@maziyarpanahi just to clarify, does |
also, did you reinstall transformers to the whatever version is in requirements.txt at the axolotl commit that works? |
I tried reverting axolotl to the git sha you pointed out as well as downgraded transformers and it qwen2 still doesn't RL finetune. If you can provide more info on how to reproduce a working finetune on qwen2, I think I can track down from there what is wrong. |
Not just transformers, I checkout
that's strange, I launch a new RunPod and just run this script to make the DPO works:
|
everything was first downgraded to accelerate-0.32.0 axolotl-0.4.1 bitsandbytes-0.43.1 datasets-2.19.1 deepspeed-0.14.3+bc48371c flash-attn-2.6.1 fsspec-2024.3.1 gcsfs-2024.3.1 peft-0.11.1 s3fs-2024.3.1 scikit-learn-1.2.2 transformers-4.43.1 upgrade individually & test: peft@39c60ffca9c1d1cc606a16654cfe9cd66b363a70 ❌ @BenjaminBossan it seems this huggingface/peft#1742 is the problematic changeset that breaks qlora+DPO+FSDP. Also, i checked regular bf16 lora+DPO+FSDP and that works fine, so it's only a 4bit quantization issue with peft. |
adding this to the end of self.to(device, dtype) |
@maziyarpanahi in the meantime, you can simply git clone the peft repo, uninstall peft, and then do a |
Also, i believe this issue is specific to the Qwen2 models, as llama didn't have this issue. |
Thanks for making me aware of this issue. This should not happen, so let's do our best to resolve this. I cannot spin up a machine as suggested for reproducing the issue. Therefore, I tried something with my local setup of 2 GPUs. I basically took the peft/examples/sft script and made some changes to use @winglian what is the bnb config that is effectively used when using this axolotl example?
To give more context, the idea of that PR was to allow having multiple, say, LoRA adapters that reside on different devices. Before that, we would just move the whole module to the device (and potentially dtype) of the base model. The change you suggested would effectively restore that behavior. My suspicion of what must be happening is that some part of the Qwen module is not converted anymore after the change. @winglian Would it be possible for you to store all the devices and dtypes from before and after you additional line and check if they're different, in which case they get printed? So something like this: # detect if some parameters have not yet been moved to the correct dtype/device
dtypes_before = [(n, p.dtype) for n, p in self.named_parameters()]
devices_before = [(n, p.device) for n, p in self.named_parameters()]
self.to(device, dtype) # <= the line you added
dtypes_after = [(n, p.dtype) for n, p in self.named_parameters()]
devices_after = [(n, p.device) for n, p in self.named_parameters()]
if (dtypes_before != dtypes_after) or (devices_before != devices_after):
print(self)
print("dtypes before", dtypes_before)
print("dtypes after", dtypes_after)
print("devices before", devices_before)
print("devices after", devices_after)
1/0 # raise error to exit For me, this does not detect any change and thus passes. Edit: Made some changes to the snippet to also show the param names. |
looks like the
|
Please check that this issue hasn't been reported before.
Expected Behavior
The FSDP is supported and it works properly with DPO up to
608a2f3
commit tag. However, after this, it crashes.Current behaviour
With the latest changes in the main, it takes a very long time before reaching
Map
the dataset. Then it crashes.Steps to reproduce
winglian/axolotl-runpod:main-latest
templateConfig yaml
Possible solution
must checkout prior to the change that broken FSDP:
git checkout 608a2f3
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
e299312
Acknowledgements
The text was updated successfully, but these errors were encountered: