This is the offcial github for the paper Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models from Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, Ngai Wong.
TL,DR: We provide a deeper insight into forward KL and reverse KL in the KD for LLM and then propose a novel AKL based on the analysis.
Conclusion:
In the KD for LLMs, the mean-seeking and mode-seeking behaviors do not hold for forward KL (FKL) and reverse KL (RKL),respectively. Instead, they share the same optimization objective. Meanwhile, FKL focuses on the head part and RKL focuses on the tail part at the beginning epochs.
To reproduce the toy examples, you can refer to the
toy_examples/FR_KL.ipynb
and toy_examples/FR_compare.ipynb
.
Please follow the minillm for the environment and dataset.
Introduce the AKL into the KD setting. (Mainly on this line)
And then run the experiments and evaluate the student.
For results on Winogrande, OpenBookQA, BoolQ, ARC, please use this tool.
Taiqiang Wu: takiwu@connect.hku.hk
If you find this paper useful, please cite it by using the following BibTeX entry.
@article{wu2024rethinking,
title={Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models},
author={Wu, Taiqiang and Tao, Chaofan and Wang, Jiahao and Yang, Runming and Zhao, Zhe and Wong, Ngai},
journal={arXiv preprint arXiv:2404.02657},
year={2024}
}