Adapter support for GPTNeoX #521

ajesujoba · 2023-03-17T12:14:14Z

I have implemented adapter for GPTNeoX following the instructions in the documentation. It passed all tests but during the training of the language adapter, it trained the prediction head too. Do you by chance have an idea why this is happening? Do a PR?

calpt · 2023-03-20T17:57:10Z

Hey @ajesujoba, this sounds great, would be awesome to have GPTNeoX support integrated into the library, so feel free to do a PR!

Regarding your question on language adapter training, could you add some more context on what you observed and which behavior you expected (ideally with some code snippet). Thank you!

ajesujoba · 2023-03-21T17:26:42Z

Thanks for your response @calpt. I have made a PR
So I was trying to train a German language adapter using the implemented GPTNeoX with the script below:

LANG="de"
python run_clm.py \
        --model_name_or_path EleutherAI/pythia-70m \
        --train_file $DATADIR/train.txt \
        --validation_file  $DATADIR/dev.txt \
        --output_dir $OUTDIR/$LANG \
        --do_train \
        --do_eval \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 8 \
        --gradient_accumulation_steps 1 \
        --learning_rate 5e-5 \
        --max_steps 500 \
        --num_train_epochs 25 \
        --save_steps 10000000 \
        --overwrite_output_dir \
        --train_adapter \
        --adapter_config pfeiffer+inv \
        --evaluation_strategy steps \
        --eval_steps 1000000 \
        --load_best_model_at_end \
        --save_total_limit 1

The script ran successfully, but instead of training just the adapters, it was training both the adapter modules and the CLM head. So the total number of trainable paramters were 26087360 instead of just 331712

[INFO|trainer.py:1650] 2023-03-21 18:13:53,209 >> ***** Running training *****
[INFO|trainer.py:1651] 2023-03-21 18:13:53,209 >>   Num examples = 386
[INFO|trainer.py:1652] 2023-03-21 18:13:53,209 >>   Num Epochs = 20
[INFO|trainer.py:1653] 2023-03-21 18:13:53,209 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1654] 2023-03-21 18:13:53,209 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1655] 2023-03-21 18:13:53,209 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1656] 2023-03-21 18:13:53,209 >>   Total optimization steps = 500
[INFO|trainer.py:1657] 2023-03-21 18:13:53,210 >>   Number of trainable parameters = 26087360

I was able to manually freeze these CLM head using model.embed_out.requires_grad_(False) in the run_clm.py but this is not expected. Kindly let me know if I need to provide more context.

calpt · 2023-03-27T19:01:21Z

Thanks for providing the additional context (and of course thanks for opening the PR!).

After looking into it a bit deeper, the cause for this behavior seems to be that GPT-NeoX does not tie the weights of the input and output projection layers. By default, adapter-transformers will only freeze the weights of the base model, excluding the weights of any prediction head (as you usually want to fine-tune that with the adapter). Thus, for LM heads, freezing the output projection layer relies on the fact that most models supported so far share these weights with the input projection (which is part of the base model, therefore frozen).

To ensure the expected behavior also for GPT-NeoX, we'd probably need to freeze the output projection manually somewhere in the code. Maybe adding it to the freeze_model() method in the model mixin via self.get_output_embeddings() would work.

ajesujoba · 2023-03-27T19:09:49Z

Hi @calpt, thanks for your feedback. I thought as much, I also noticed that they did not tie the weights of the input and output projection layers.

Yes, I agree that freezing the prediction head somewhere else such as within freeze_model() would be the best option. I guess this would be done at your end right?

calpt · 2023-03-27T19:13:52Z

You can directly integrate a fix for this into your PR with the model integration if you like. Otherwise, I could also add it independently.

ajesujoba · 2023-03-27T21:31:09Z

Checking again it appears it is not plausible to have it fixed within freeze_model(). self within model_mixin.py is the base_model without any prediction head (because the embeddings are not tied). I may be wrong. So I guess a possible place to fix this would be within training.py. If it is fine with you, you can add it independently so that I don't break a lot of things

ajesujoba added the question Further information is requested label Mar 17, 2023

ajesujoba linked a pull request Mar 21, 2023 that will close this issue

Adding adapter support for NeoX #523

Open

calpt linked a pull request Mar 27, 2023 that will close this issue

Adding adapter support for NeoX #523

Open

calpt added enhancement New feature or request and removed question Further information is requested labels Mar 27, 2023

calpt mentioned this issue Apr 19, 2023

Make sure output embeddings are frozen when training adapters #537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapter support for GPTNeoX #521

Adapter support for GPTNeoX #521

ajesujoba commented Mar 17, 2023 •

edited

Loading

calpt commented Mar 20, 2023

ajesujoba commented Mar 21, 2023 •

edited

Loading

calpt commented Mar 27, 2023

ajesujoba commented Mar 27, 2023

calpt commented Mar 27, 2023

ajesujoba commented Mar 27, 2023

Adapter support for GPTNeoX #521

Adapter support for GPTNeoX #521

Comments

ajesujoba commented Mar 17, 2023 • edited Loading

calpt commented Mar 20, 2023

ajesujoba commented Mar 21, 2023 • edited Loading

calpt commented Mar 27, 2023

ajesujoba commented Mar 27, 2023

calpt commented Mar 27, 2023

ajesujoba commented Mar 27, 2023

ajesujoba commented Mar 17, 2023 •

edited

Loading

ajesujoba commented Mar 21, 2023 •

edited

Loading