Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapter support for GPTNeoX #521

Open
ajesujoba opened this issue Mar 17, 2023 · 6 comments · May be fixed by #523
Open

Adapter support for GPTNeoX #521

ajesujoba opened this issue Mar 17, 2023 · 6 comments · May be fixed by #523
Labels
enhancement New feature or request

Comments

@ajesujoba
Copy link

ajesujoba commented Mar 17, 2023

I have implemented adapter for GPTNeoX following the instructions in the documentation. It passed all tests but during the training of the language adapter, it trained the prediction head too. Do you by chance have an idea why this is happening? Do a PR?

@ajesujoba ajesujoba added the question Further information is requested label Mar 17, 2023
@calpt
Copy link
Member

calpt commented Mar 20, 2023

Hey @ajesujoba, this sounds great, would be awesome to have GPTNeoX support integrated into the library, so feel free to do a PR!

Regarding your question on language adapter training, could you add some more context on what you observed and which behavior you expected (ideally with some code snippet). Thank you!

@ajesujoba ajesujoba linked a pull request Mar 21, 2023 that will close this issue
@ajesujoba
Copy link
Author

ajesujoba commented Mar 21, 2023

Thanks for your response @calpt. I have made a PR
So I was trying to train a German language adapter using the implemented GPTNeoX with the script below:

LANG="de"
python run_clm.py \
        --model_name_or_path EleutherAI/pythia-70m \
        --train_file $DATADIR/train.txt \
        --validation_file  $DATADIR/dev.txt \
        --output_dir $OUTDIR/$LANG \
        --do_train \
        --do_eval \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 8 \
        --gradient_accumulation_steps 1 \
        --learning_rate 5e-5 \
        --max_steps 500 \
        --num_train_epochs 25 \
        --save_steps 10000000 \
        --overwrite_output_dir \
        --train_adapter \
        --adapter_config pfeiffer+inv \
        --evaluation_strategy steps \
        --eval_steps 1000000 \
        --load_best_model_at_end \
        --save_total_limit 1

The script ran successfully, but instead of training just the adapters, it was training both the adapter modules and the CLM head. So the total number of trainable paramters were 26087360 instead of just 331712

[INFO|trainer.py:1650] 2023-03-21 18:13:53,209 >> ***** Running training *****
[INFO|trainer.py:1651] 2023-03-21 18:13:53,209 >>   Num examples = 386
[INFO|trainer.py:1652] 2023-03-21 18:13:53,209 >>   Num Epochs = 20
[INFO|trainer.py:1653] 2023-03-21 18:13:53,209 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1654] 2023-03-21 18:13:53,209 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1655] 2023-03-21 18:13:53,209 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1656] 2023-03-21 18:13:53,209 >>   Total optimization steps = 500
[INFO|trainer.py:1657] 2023-03-21 18:13:53,210 >>   Number of trainable parameters = 26087360

I was able to manually freeze these CLM head using model.embed_out.requires_grad_(False) in the run_clm.py but this is not expected. Kindly let me know if I need to provide more context.

@calpt
Copy link
Member

calpt commented Mar 27, 2023

Thanks for providing the additional context (and of course thanks for opening the PR!).

After looking into it a bit deeper, the cause for this behavior seems to be that GPT-NeoX does not tie the weights of the input and output projection layers. By default, adapter-transformers will only freeze the weights of the base model, excluding the weights of any prediction head (as you usually want to fine-tune that with the adapter). Thus, for LM heads, freezing the output projection layer relies on the fact that most models supported so far share these weights with the input projection (which is part of the base model, therefore frozen).

To ensure the expected behavior also for GPT-NeoX, we'd probably need to freeze the output projection manually somewhere in the code. Maybe adding it to the freeze_model() method in the model mixin via self.get_output_embeddings() would work.

@calpt calpt linked a pull request Mar 27, 2023 that will close this issue
@calpt calpt added enhancement New feature or request and removed question Further information is requested labels Mar 27, 2023
@ajesujoba
Copy link
Author

Hi @calpt, thanks for your feedback. I thought as much, I also noticed that they did not tie the weights of the input and output projection layers.

Yes, I agree that freezing the prediction head somewhere else such as within freeze_model() would be the best option. I guess this would be done at your end right?

@calpt
Copy link
Member

calpt commented Mar 27, 2023

You can directly integrate a fix for this into your PR with the model integration if you like. Otherwise, I could also add it independently.

@ajesujoba
Copy link
Author

Checking again it appears it is not plausible to have it fixed within freeze_model(). self within model_mixin.py is the base_model without any prediction head (because the embeddings are not tied). I may be wrong. So I guess a possible place to fix this would be within training.py. If it is fine with you, you can add it independently so that I don't break a lot of things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants