Experiments with distilling large language models.
- alpha_ce : Linear weight for KLDivLoss between student and teacher token classification outputs.
- alpha_clm : Linear weight for CrossEntropyLoss between stduent token classification outputs and the labels.
- alpha_mse : Linear weight for MSELoss between student and teacher token classification logits.
- alpha_cos : Linear weight for CosineEmbeddingLoss between student and teacher last layers hidden_states.(works only when the model hidden_size of student and teacher are same)
- Use create_data.py (for small dataset) to create a pickle of input_ids list or use create_data_for_large_dataset.py (for large dataset) to create shards of npy files.
- Use run_distillation.py to start distillation training. For training on large dataset set --train_on_large_dataset param.