You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Transformer-based language models spread FLOPs uniformly across inputsequences. In this work we demonstrate that transformers can instead learn todynamically allocate FLOPs (or compute) to specific positions in a sequence,optimising the allocation along the sequence for different layers across themodel depth. Our method enforces a total compute budget by capping the numberof tokens ($k$) that can participate in the self-attention and MLP computationsat a given layer. The tokens to be processed are determined by the networkusing a top-$k$ routing mechanism. Since $k$ is defined a priori, this simpleprocedure uses a static computation graph with known tensor sizes, unlike otherconditional computation techniques. Nevertheless, since the identities of the$k$ tokens are fluid, this method can expend FLOPs non-uniformly across thetime and model depth dimensions. Thus, compute expenditure is entirelypredictable in sum total, but dynamic and context-sensitive at the token-level.Not only do models trained in this way learn to dynamically allocate compute,they do so efficiently. These models match baseline performance for equivalentFLOPS and wall-clock times to train, but require a fraction of the FLOPs perforward pass, and can be upwards of 50% faster to step during post-trainingsampling.
AkihikoWatanabe
changed the title
あ
Mixture-of-Depths: Dynamically allocating compute in transformer-based
language models, David Raposo+, N/A, arXiv'24
Apr 7, 2024
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: