-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add LayerSkip to AO #633
Comments
Layer dropout during training looks like some form of Stochastic Depth. Some related implementations
A glance at LayerSkip paper suggests that they mask each sample independently in a batch. Probably need some tricks to see speedups? The torchtune PR implements it by indexing, applying the function, and writing back subset of a batch. Curious to see if the extra overhead is outweighed by less computation during training. |
Yup, the layer dropout aspect of layer skip is basically a version of stochastic depth, that's part of the reason why I'm interested in having it in AO, since a generic stochastic depth function / module would be useful outside of just LLMs. IIRC when talking to mostafa he is faster when masking + rewriting but the speedups mostly come from the self-speculative decoding part of the technique. @mostafaelhoushi can you share some benchmarks about the layer dropout implementation specifically when you update the issue? Thanks. |
Sorry for the delay from my side. Other PapersI would like to mention other papers or models that used layer dropout (aka stochastic depth):
Other Implementations
Benchmark ResultsOn TorchTune, I ran this command on a single A100 GPU
and got these measurements:
I also want to tag @danthe3rd as he guided me to implement the per-sample layer dropout and he has implemented it for Dinov2. |
* attempt 1 * updates * try with cache * fix * fix * fix * update * fix * test * clean up * more changes * more changes * more changes * more changes * more changes * more changes * more changes * more changes * more changes * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * add commit hash * Pin to the latest stable ExecuTorch commit. * Pin to the latest stable ExecuTorch commit. * update * update * change pin * updates * updates * update pin --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>
https://arxiv.org/abs/2402.17812 DropBP may be an option too (only skip layers in the backward, they claim it leads to better accuracy). |
Tracker issue for adding LayerSkip to AO.
This is a training and inference optimization that is similar to layer-wise pruning. It's particularly interesting for LLM inference because it combines very cleanly with speculative decoding to provide up to a 1.86x speedup.
@mostafaelhoushi is interested in adding this to torchtune and is interested in upstreaming a subset of the code to ao. See here for more details. In particular, he's interested in doing this without having to alter the module definition.
This is attractive because this part of LayerSkip is not unique to LLMs and can be used for other models. (@mostafaelhoushi to fill out with relevant results).
What is being proposed:
for LayerSkip there is a training recipe and there is an inference recipe:
The text was updated successfully, but these errors were encountered: