Skip to content

Latest commit



149 lines (112 loc) · 7.35 KB

File metadata and controls

149 lines (112 loc) · 7.35 KB

Dynamic Vocabulary Pruning in Early-Exit LLMs
(NeurIPS ENLSP 2024)

J. Vincenti, K.A. Abdel Sadek, J. Velja, M. Nulli, M. Jazbec

This repository is cloned from the code-base Fast_Robust_Early_Exit (here their paper). Our research aims to further extend their work by implementing a Softmax Exiting with reduced vocabulary size.


In order to set up the environment for reproducing our experiments, install the necessary packages with:

$ pip install -r requirements.txt

Or via the environment file:

conda env create --name environment_name -f environment.yml

The codebase handles automatically model and dataset downloading. Beware of this when running the code for the first time!

Models and Checkpoints

We use T5-large as the baseline model for our experiments. The non-finetuned and finetuned model weights are available on HuggingFace, respectively at google and jvelja.

The code implementation of the model is available at models/deploying_t5.


We perform evaluation experiments on two different NLP tasks: Summarization -SamSum dataset- and Question Answering -SQuAD dataset-.

To reproduce the experiments you can follow the guide below. Each individual file in the scripts can be run, by selecting the appropriate name, with the command below:

sh > jobname.out

If you wish to run all the scripts at once - for example if you want to reproduce all results in one go, you can use the following command:

for job in *.job; do sbatch $job; done

Illustration of an Example Case

Here below you can find the explicit command to run the experiments for softmax confidence with adaptive pruning approach

srun python \
    --model_name_or_path google-t5/t5-large \
    --do_eval \
    --dataset_name squad \
    --context_column context \
    --question_column question \
    --answer_column answers \
    --output_dir ./save/squad_t5-large/ \
    --per_device_eval_batch_size 1 \
    --deploy_scenario True \
    --use_synchronize True \
    --overwrite_output_dir \
    --predict_with_generate \
    --max_seq_length 512 \
    --use_early_exit True \
    --exit_conf_type softmax \
    --exit_conf_threshold 0.9 \
    --exit_min_layer 19 \
    --include_inputs_for_metrics False \
    --use_auth_token True \
    --type_vocab_reduct True \
    --k 256 \

Parameters Explanation

In addition to the parameters previously implemented, we have introduced new ones specific to our tasks. For further details, please refer to the additional_args documentation. For convenience, we will also highlight the essential parameters from the previous implementation that are utilized in our current setup.

Essential Parameters:

Method agnostic parameters
  • -m: the file responsible for the task. Its structure is run_$TASK. Possible choices: question_answering, summarization.
  • --model_name_or_path: the model to be used for the task. Possible choices: google-t5/t5-large, jvelja/t5-squad, jvelja/t5-samsum.
  • --do_eval True: this should be always True for evals.
  • --deploy_scenario True: this should be always True to use deploying_[MODEL_NAME].py for our implementation.
  • --use_early_exit True: use conventional early-exiting framework.
  • --exit_conf_threshold [float]: threshold value to decide whether to exit or not. Our experiments were made with 0.9.
  • --exit_min_layer [int]: the minimum number of layers to forward to decide the exiting.
  • --include_inputs_for_metrics. Always to be set to True to avoid mismatch in output metrics.
  • --exit_conf_type softmax: set the confidence measure to softmax values
  • --type_vocab_reduct [bool]: Either True or False, this will prune the vocabulary matrix.
  • --k [int]: What amount of values should be retained by the pruned vocabulary matrix.
  • --plotting_logits False: if set to True this will plot the confidence, f1, and boxplots.
  • --final_flops False: if set to True this will showcase the amount of flops calculated during confidence estimation.

Sample task-specific bash files can be found in the src/scripts directory.

W&B logging

To enable wandb logging of your results, you can follow the standard procedure explained in wandb login infos. In our code, you should uncomment the following lines of code
and set the statement to "false"

os.environ["WANDB_DISABLED"] = "true" ---> os.environ["WANDB_DISABLED"] = "false"

This, together with the usual wandb.init(), will save every evaluation metric into your wandb project. This line of code can be found within run_question_answering / run_summarization.


If you find this repo useful for your research, please consider citing our paper:

      title={Dynamic Vocabulary Pruning in Early-Exit LLMs}, 
      author={Jort Vincenti and Karim Abdel Sadek and Joan Velja and Matteo Nulli and Metod Jazbec},
