PyTorch implementations of algorithms for knowledge distillation.
$ docker build -t kd -f Dockerfile .
$ docker run -v local_data_path:/data -v project_path:/app -p 0.0.0.0:8084:8084 -it kd
- Task-specific distillation from BERT to BiLSTM. Data: SST-2 binary classification.
-
Cristian Bucila, Rich Caruana, Alexandru Niculescu-Mizil "ModelCompression" (2006) pdf.
-
Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (2019) https://arxiv.org/abs/1910.01108.
-
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin "Distilling Task-Specific Knowledge from BERT into Simple Neural Networks" (2019) https://arxiv.org/abs/1903.12136.
-
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations" (2019) https://arxiv.org/abs/1909.11942.
-
Rafael Müller, Simon Kornblith, Geoffrey Hinton "Subclass Distillation" (2020) https://arxiv.org/abs/2002.03936.
-
Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova "Well-Read Students Learn Better: On the Importance of Pre-training Compact Models" (2020) https://arxiv.org/abs/1908.08962.