-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate test fails: Exception: Could not find the transformer layer class to wrap in the model #2872
Comments
@muellerzr I saw that
Maybe it is good to remove I ran through the code and it seems like that But then, the isssue for Might be good to fix this as running |
Thanks for this detailed report. Debugging this type of issue can be really difficult, props for trying out a bunch of different things. At a first glance, I can't spot any obvious mistakes in your script. The model repr also looks good. From my own testing (same model and LoRA config), I can tell that the PEFT examples for FSDP QLoRA that you linked work for me (they use One difference that I spotted is that I used One idea to debug this issue further: Could you edit your local accelerate code in this line and print the values for |
Thanks for your response! I have just tested the Huggingface Bitsandbytes Guide . This script seems to work (and wow it is much better than mine haha). I use LLama3-8B and everything can be saved locally, which is where my script fails, sending a callback to I think I have found in my script where the issue is. I believe it's in PEFT and for some reason the model isn't seen as a PEFT model. When I use This causes my script to error on this line in the trainer.py file. For some reason my model, that is given to the SFTTrainer alongside the LoraConfig, isn't truly a PEFT model. I have added this snippet in the train.py file of the guide
This returns True!! So it seems that because my model isn't a PEFT model, that it can't be saved. I'm not sure yet why my model isn't a PEFT model but I'm guessing this has something to do with how I instantiate the model in my script:
I'm guessing that the device_map but more specifically the
I had encountered this before and deciced to prevent this ValueError by using this and this. I find it odd that
When I compared it to the nvidia-smi of the Huggingface guide, I saw that all gpu's were perfectly divided; i.e. every gPU had the exact same amount of memory usage. This could of course also be something else in my script which is set on GPU 0. Concerning the accelerate test issue: |
Glad that you got it to work.
I'm confused, first you reference the bnb guide but then you mention the PEFT guide. Which one was the one that worked? This might be of interest for future users who encounter the same issue.
How did you test that? The model is not a PEFT model after you call
And when you run the same snippet in your script, does it return
Hmm, I can't really help you with that, hopefully @muellerzr can tell us more once he's back at work.
Don't worry too much about that. As long as you get your model to train, we're good. |
Hi! I've added the print statements in the line you asked. The print staments are as follows:
With the following output when I run
So, it seems that the list is empty and therefore |
I used the bits and bytes guide which actually uses the PEFT example repo. I added a print statement here in train.py:
This is the example from the PEFT repo which the BNB guide references. In my own script I added the following print statement:
This also prints True. I made a small mistake where I checked the status of I'm a bit lost as to what goes wrong when saving the model. For some reason it seems that the trainer.model isn't recognised as a PEFT model, which is very contradictory as the This causes the ValueError being raised here. |
Here I am again haha! But I have good news, I found the issue with my training script: When I removed the wandb reporting from my
Then, I am able to save the model!! Let me show you the issue, here is the full traceback:
You can see that trainer_callback.py is called when the model is done training:
Then, eventually, integration_utils.py is called and a fake_trainer is created here. But as you can see in this code snippet, the Model is set to None, because the model isn't given as an argument in
This brings us back to the trainer.py file. And because Model is set to None, the following if-statement will be true and the ValueError will be raised:
I'm not too sure as I haven't debugged the issue yet, but I think if we just pass model in This also make sense as to why the guides work, because all of them don't use wandb! It could also be a good idea to mention this in the guides, but fixing it would be even better :)! Let me know what you think of this issue and my solution, we could open a PR. Please, also let me know what we need to do with the |
Oh I see, I didn't know that.
Thanks a lot for this detailed investigation. I agree that this is certainly not the intended behavior. I'm not an expert on As you suggest, one fix could be to ensure that the model is passed correctly, so that the fake trainer instance so that the checks in |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Packages
Code
Expected behavior
tldr;
I'm encountering an error when running
accelerate test
. This happens when runningaccelerate test
with and without settingfsdp_transformer_layer_cls_to_wrap
toLlamaDecoder, LlamaMLP
.The same issue seems to happen when I run my training script:
fsdp_transformer_layer_cls_to_wrap
toLlamaDecoder, LlamaMLP
, I get the same error as runningaccelerate test
:Exception: Could not find the transformer layer class to wrap in the model.
fsdp_transformer_layer_cls_to_wrap
from the config file, I get a different error when running my training script:ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details
Introduction
Hi!
I'm trying to fine-tune LLama3-8B on a summarization dataset of about 1500 instances. The dataset contains long documents, often over 8K tokens. I want to use FSDP + QLORA to try and finetune LLama3 8B.
I'm following these guides as inspiration:
bitsandbytes Guide
Phil Schmid Guide
Huggingface Accelerate Guide
Huggingface PEFT Guide
Huggingface Bitsandbytes Guide
First issue:
At first, I didn't set
fsdp_transformer_layer_cls_to_wrap
as it isn't defined in the mentioned guides. And with my training script I am able to start training. But, then I'm unable to save the model or push it to the hub after training. I have had a full training run before and could see the loss reducing.The yaml file looks as follows in this scenario:
When running :
I get the following output
There is a warning, a print statement and finally the ValueError.
The warning is repeated many times and seems to come from here.
It seems like the model isn't wrapped properly or that something is wrong with the quantization, as the ValueError indicates. But I don't think it has anything to do with quantization because when I print my model I see the following:
Second issue:
I couldn't really figure out what this issue is and after a while I decied to check out if my accelerate config was set up properly, so I decided to run
And I get the following error:
I thought the issue was that I didn't set
fsdp_transformer_layer_cls_to_wrap
. So after checking out #1947 and tatsu-lab/stanford_alpaca#58 . I figured that this was the issue and setfsdp_transformer_layer_cls_to_wrap
toLlamaDecoderLayer, LlamaMLP
(Like in the system info above). I thought this would solve the issue but when I run:This error occurs:
This is again the same issue that occurs when running
accelerate test
with and without settingfsdp_transformer_layer_cls_to_wrap
:As you can see the issue has something to do with that the transformer layers cannot be found.
I think that this has something to do with how the model is wrapped in FSDP but I'm not sure what is going on.
Please let me know if you have any idea as I'm getting stuck in as of now! I also find it quite odd that
accelerate test
doesn't work with and without the settingfsdp_transformer_layer_cls_to_wrap
.The text was updated successfully, but these errors were encountered: