Decoder-Only Transformer #4329

spencerp · 2022-01-27T16:51:30Z

Supporting decoder-only transformer training:

parlai train_model --model transformer/decoder

Summary of changes

New TransformerDecoderLayer without encoder attention
New TransformerDecoder that concatenates input and "encoder state"
- Needed to pass the dictionary to TransformerDecoder to properly do concatenation
- Right-pad the label, left-pad the context
New PassThroughEncoder, similar to IdentityLayer but compatible with TransformerEncoder API
- There are many assumptions about there being an "encoder" in TorchGeneratorAgent, so the path of most code reuse led to using a dummy encoder to satisfy those assumptions
New DecoderAgent to override build_model
Extract label from scores to calculate loss (done at end of TransformerDecoderOnly.forward)
Only use last query_len rows from incremental attention to account for incremental step from context to first generated token
Added DecoderIncrState and DecoderLayerIncrState type aliases
Added unit tests

Summary of structural changes in decoder.py

Testing steps

python -m pytest tests/test_transformers.py

python -m pytest tests/test_light_whoami.py

Also ran some small training runs locally to sanity check:

TF_ARGS="--embedding-size 128 --ffn-size 512 --batchsize 16 --eval-batchsize 16 --model-parallel true --variant prelayernorm --n-heads 8 --n-positions 512 --activation gelu --text-truncate 256 --label-truncate 128 -lr 7e-05 --lr-scheduler invsqrt --optimizer adam --warmup_updates 1000 -vp 10 -vmt ppl -vmm min --load-from-checkpoint false -vstep 500 --validation-max-exs 1000 -tstep 150000 --log-every-n-secs 30 --update-freq 1 --dynamic-batching full"
parlai train_model -t convai2 --model transformer/decoder --model-file /tmp/test_decoder_only_model --n-layers 2 $TF_ARGS
parlai train_model -t convai2 --model transformer/generator --model-file /tmp/test_enc_decoder_model --n-encoder-layers 1 --n-decoder-layers 1 $TF_ARGS

Benchmark Comparison with Encoder-Decoder
Both trained to 10k steps with 120M parameters (8 layers) on 2x Quadro GP100 GPUs. The encoder/decoder is closer to 140M parameters due to the cross-attention.

TF_ARGS="--embedding-size 1024 --ffn-size 4096 --batchsize 16 --eval-batchsize 16 --model-parallel true --variant prelayernorm --n-heads 16 --n-positions 512 --activation gelu --text-truncate 256 --label-truncate 128 -lr 7e-05 --lr-scheduler invsqrt --optimizer adam --warmup_updates 1000 -vp 10 -vmt ppl -vmm min --load-from-checkpoint false -vstep 1000 --validation-max-exs 1000 -tstep 10000 --log-every-n-secs 30 --update-freq 1 --dynamic-batching full --fp16 true"

parlai train_model -t internal:new_reddit --model transformer/decoder --model-file /tmp/test_decoder_only_model --n-layers 8 $TF_ARGS
...
13:17:37 | time:6960s total_exs:433912 total_steps:9950 epochs:0.00 time_left:35s
    clen  clip  ctpb  ctps  ctrunc  ctrunclen  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  llen  loss        lr  ltpb  ltps  ltrunc  ltrunclen   ppl  token_acc  token_em  total_train_updates  tpb  tps   ups
   58.72     1  2603  4957  .01423      .8327 85.63 2248            131072  1.711    .2126 26.09 4.036 2.219e-05  1173  2234       0          0 56.59      .2843         0                 9950 3776 7191 1.905

13:18:04 | Stopping from Maximum LR steps

parlai train_model -t internal:new_reddit --model transformer/generator --model-file /tmp/test_enc_decoder_model --n-encoder-layers 4 --n-decoder-layers 4 $TF_ARGS
...
10:52:31 | time:5367s total_exs:431792 total_steps:9950 epochs:0.00 time_left:27s
    clen  clip  ctpb  ctps  ctrunc  ctrunclen  exps  exs  fp16_loss_scalar  gnorm  gpu_mem  llen  loss        lr  ltpb  ltps  ltrunc  ltrunclen   ppl  token_acc  token_em  total_train_updates  tpb   tps  ups
   64.57     1  2711  6939  .02041      1.706 110.4 2156            131072  1.633    .1838 28.03 4.066 2.219e-05  1208  3094       0          0 58.33      .2803         0                 9950 3919 10033 2.56

10:52:51 | Stopping from Maximum LR steps

parlai/agents/transformer/modules/attention.py

parlai/agents/transformer/modules/decoder.py

stephenroller

Yeah looks pretty good to me. Do you mind giving us a benchmark on -t internal:new_reddit with 125M params? trained to say, 10k steps?

parlai/agents/transformer/decoder_only.py

stephenroller · 2022-03-01T18:22:41Z

parlai/agents/transformer/decoder_only.py

+        Override of ``TorchAgent.build_model``.
+        """
+        assert (
+            self.opt['n_encoder_layers'] == -1


wishlist: this would be a really great opportunity to finally implement the parser.remove_arg functionality we've desired for a long time

parlai/agents/transformer/modules/decoder.py

stephenroller · 2022-03-01T18:25:19Z

parlai/core/params.py

@@ -1393,3 +1393,10 @@ def error(self, message):
        self.print_help()
        _sys.stderr.write('\nParse Error: %s\n' % message)
        _sys.exit(2)
+
+
+def default(val, default):


stephenroller · 2022-03-04T21:55:11Z

Alright it's a little slower but good enough for today. Let Jason know it's done!

spencerp · 2022-03-05T05:24:51Z

Hm, there's a distillation test failing which I don't understand yet... dec_hid_loss and enc_dec_attn_loss don't match what the test expects which is suspicious, so I'm going to wait to merge until I understand that better.

klshuster · 2022-03-29T18:19:50Z

cc @EricMichaelSmith: there are two distillation tests failing due to this PR, do you have any idea why that might be the case? it looks like very small (but consistent) differences in testing (that is, these tests do not seem to be failing on main but consistently fail on this branch)

stephenroller · 2022-03-29T20:38:19Z

Let's just let this sit a while. The regression failure is a little concerning. It's enough to make me worry the code has changed in a subtle way, but close enough that it's tempting to just --force-regen and forget about it.

EricMichaelSmith · 2022-03-29T21:29:18Z

Let's just let this sit a while. The regression failure is a little concerning. It's enough to make me worry the code has changed in a subtle way, but close enough that it's tempting to just --force-regen and forget about it.

Hmm seconded - yeah, @klshuster I'm not sure offhand, but perhaps this does indicate that the forward pass has indeed changed in some subtle way?

github-actions · 2022-04-29T00:09:14Z

This PR has not had activity in 30 days. Closing due to staleness.

stephenroller · 2022-04-29T03:18:06Z

spencerp · 2022-04-29T03:48:36Z

I will resuscitate it next week :)

spencerp · 2022-05-03T18:04:30Z

Finally got back to this today and figured out the issue!

Apparently the distillation code is sensitive to the order in which modules are initialized 😔. I made a branch to demonstrate the minimal change needed to repro the test break: #4526

To my knowledge, module initialization order has no implications on training dynamics/model performance. So I'm inclined to call this a distillation bug and outside the scope of this PR. Anyone know differently? @EricMichaelSmith

If I don't hear any objections by Thursday I'll merge as-is with the broken test and file an issue to fix the bug in the distillation code.

spencerp · 2022-05-03T19:12:38Z

Chatted with @EricMichaelSmith offline. He pointed out that this could be a result of a different order of random operations done during module initialization. I was able to confirm that's the problem, details in #4526

So I'm just going to update the test numbers and merge.

…umbers

spencerp added 7 commits January 18, 2022 17:30

quick and dirty decoder-only implementation

ee1dd65

Merge branch 'main' into decoder-only

5388bfa

fix decoder_only incremental decoding

8ac8529

remove unused code, add some comments, propogate func signature change

814b99c

consolidate code in decoder.py

279bb51

unify encoder_state

f36c4be

export PassThroughEncoder

5116656

facebook-github-bot added the CLA Signed label Jan 27, 2022

klshuster reviewed Jan 27, 2022

View reviewed changes

parlai/agents/transformer/modules/attention.py Show resolved Hide resolved

parlai/agents/transformer/modules/decoder.py Show resolved Hide resolved

spencerp added 6 commits February 1, 2022 08:30

add missing build_ functions

834bd2a

defaults in TransformerDecoderLayer __init__

0cfc9c5

Merge branch 'main' into decoder-only

11cfc7e

comments, consolidating more logic, simplified forward_layers args

01a46eb

resize token embeddings and unit test

7be772e

attempt to suppress some unused import warnings

0adf6d8

stephenroller self-requested a review March 1, 2022 16:46

Merge branch 'main' into decoder-only

911a513

stephenroller approved these changes Mar 1, 2022

View reviewed changes

spencerp added 6 commits March 3, 2022 09:11

Merge branch 'main' into decoder-only

fbdccd4

padded_tensor fp16 friendly

b4c2a62

autoformat

9251c67

decoder_only -> decoder

58e2289

more documentation

661f1d3

update name in test

e29219e

spencerp added 3 commits March 4, 2022 17:59

Merge branch 'main' into decoder-only

1afc309

add missing dict args

652123e

more argument massaging

c19cb44

klshuster mentioned this pull request Mar 25, 2022

SeeKeR #4447

Merged

Merge branch 'main' into decoder-only

d4f5660

Merge branch 'main' into decoder-only

cadd468

github-actions bot added the stale label Apr 29, 2022

github-actions bot removed the stale label Apr 30, 2022

Merge branch 'main' into decoder-only

7685ff2

spencerp added 3 commits May 3, 2022 12:18

update TestBartDistillation::test_narrow_distillation_losses numbers

17b179e

update TestTransformerDistillation::test_narrow_distillation_losses n…

996b9bb

…umbers

fix _pad_tensor in seeker

d399a62

spencerp merged commit ecdfbd0 into main May 4, 2022

spencerp deleted the decoder-only branch May 4, 2022 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoder-Only Transformer #4329

Decoder-Only Transformer #4329

spencerp commented Jan 27, 2022 •

edited

Loading

stephenroller left a comment

stephenroller Mar 1, 2022

stephenroller Mar 1, 2022

stephenroller commented Mar 4, 2022

spencerp commented Mar 5, 2022

klshuster commented Mar 29, 2022

stephenroller commented Mar 29, 2022

EricMichaelSmith commented Mar 29, 2022

github-actions bot commented Apr 29, 2022

stephenroller commented Apr 29, 2022

spencerp commented Apr 29, 2022

spencerp commented May 3, 2022

spencerp commented May 3, 2022

Decoder-Only Transformer #4329

Decoder-Only Transformer #4329

Conversation

spencerp commented Jan 27, 2022 • edited Loading

stephenroller left a comment

Choose a reason for hiding this comment

stephenroller Mar 1, 2022

Choose a reason for hiding this comment

stephenroller Mar 1, 2022

Choose a reason for hiding this comment

stephenroller commented Mar 4, 2022

spencerp commented Mar 5, 2022

klshuster commented Mar 29, 2022

stephenroller commented Mar 29, 2022

EricMichaelSmith commented Mar 29, 2022

github-actions bot commented Apr 29, 2022

stephenroller commented Apr 29, 2022

spencerp commented Apr 29, 2022

spencerp commented May 3, 2022

spencerp commented May 3, 2022

spencerp commented Jan 27, 2022 •

edited

Loading