Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, David Raposo+, N/A, arXiv'24 #1273

AkihikoWatanabe · 2024-04-07T23:28:49Z

URL

https://arxiv.org/abs/2404.02258

Affiliations

David Raposo, N/A
Sam Ritter, N/A
Blake Richards, N/A
Timothy Lillicrap, N/A
Peter Conway Humphreys, N/A
Adam Santoro, N/A

Abstract

Transformer-based language models spread FLOPs uniformly across inputsequences. In this work we demonstrate that transformers can instead learn todynamically allocate FLOPs (or compute) to specific positions in a sequence,optimising the allocation along the sequence for different layers across themodel depth. Our method enforces a total compute budget by capping the numberof tokens ($k$) that can participate in the self-attention and MLP computationsat a given layer. The tokens to be processed are determined by the networkusing a top-$k$ routing mechanism. Since $k$ is defined a priori, this simpleprocedure uses a static computation graph with known tensor sizes, unlike otherconditional computation techniques. Nevertheless, since the identities of the$k$ tokens are fluid, this method can expend FLOPs non-uniformly across thetime and model depth dimensions. Thus, compute expenditure is entirelypredictable in sum total, but dynamic and context-sensitive at the token-level.Not only do models trained in this way learn to dynamically allocate compute,they do so efficiently. These models match baseline performance for equivalentFLOPS and wall-clock times to train, but require a fraction of the FLOPs perforward pass, and can be upwards of 50% faster to step during post-trainingsampling.

Translation (by gpt-3.5-turbo)

Transformerベースの言語モデルは、入力シーケンス全体にFLOPsを均等に分散させる。本研究では、代わりにtransformersがシーケンス内の特定の位置にFLOPs（または計算）を動的に割り当てることを学習できることを示す。モデルの深さにわたってシーケンスに沿って割り当てを最適化するために、異なるレイヤーで計算を動的に割り当てる。我々の手法は、特定のレイヤーでのself-attentionおよびMLP計算に参加できるトークン（$k$）の数を制限することで、合計計算予算を強制する。処理するトークンは、ネットワークがtop-$k$ルーティングメカニズムを使用して決定する。$k$が事前に定義されているため、この単純な手順は、他の条件付き計算技術とは異なり、既知のテンソルサイズを持つ静的な計算グラフを使用する。ただし、$k$トークンのアイデンティティが流動的であるため、この方法は時間とモデルの深さの次元にわたってFLOPsを均等に消費することができる。したがって、計算の支出は合計で完全に予測可能であり、トークンレベルでは動的かつコンテキストに敏感である。このようにトレーニングされたモデルは、計算を動的に割り当てることを学習するだけでなく、効率的に行う。これらのモデルは、同等のFLOPSおよび壁時計時間でのベースラインパフォーマンスに一致するが、1回のフォワードパスあたりのFLOPsの一部しか必要とせず、ポストトレーニングサンプリング中のステップが50％以上高速化される可能性がある。

Summary (by gpt-3.5-turbo)

Transformerベースの言語モデルは、入力シーケンス全体に均等にFLOPsを分散させる代わりに、特定の位置にFLOPsを動的に割り当てることを学習できることを示す。モデルの深さにわたって割り当てを最適化するために、異なるレイヤーで計算を動的に割り当てる。この手法は、トークンの数を制限することで合計計算予算を強制し、トークンはtop-kルーティングメカニズムを使用して決定される。この方法により、FLOPsを均等に消費しつつ、計算の支出が予測可能であり、動的かつコンテキストに敏感である。このようにトレーニングされたモデルは、計算を動的に割り当てることを学習し、効率的に行うことができる。

AkihikoWatanabe · 2024-04-07T23:29:21Z

参考: https://x.com/theseamouse/status/1775782800362242157?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q

AkihikoWatanabe added the Pocket label Apr 7, 2024

AkihikoWatanabe changed the title あ Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, David Raposo+, N/A, arXiv'24 Apr 7, 2024

AkihikoWatanabe added Efficiency/SpeedUp NLP LanguageModel Transformer labels Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, David Raposo+, N/A, arXiv'24 #1273

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, David Raposo+, N/A, arXiv'24 #1273

AkihikoWatanabe commented Apr 7, 2024 •

edited

Loading

AkihikoWatanabe commented Apr 7, 2024

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, David Raposo+, N/A, arXiv'24 #1273

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, David Raposo+, N/A, arXiv'24 #1273

Comments

AkihikoWatanabe commented Apr 7, 2024 • edited Loading

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Apr 7, 2024

AkihikoWatanabe commented Apr 7, 2024 •

edited

Loading