Framework for convenient In-context Learning (ICL) evaluations for different datasets, LLMs, and example selection methods. In particular, it is used to evaluate the in-context example selection methods proposed in the following papers:
- Coverage-based Example Selection for In-Context Learning - BERTScore-Recall (BSR), Set-BSR. Originally implemented in the icl-coverage repository.
- GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks - GistScore, Set-GistScore. See also the gist-icl repository.
Apart from the above, it also supports the following selectors: Random, BM25, SentenceBERT (Cosine). See constants
for a list of datasets and LLMs that have currently been evaluated.
- Download datasets unavailable in HuggingFace from here and store them in
data/
. - Install Python 3.10.
- Install Python dependencies:
pip install -r requirements.txt
- Some third-party repos:
-
qdecomp_with_dependency_graphs
: required for DROP dataset.mkdir icl-demo-selection/src/third_party git clone git@github.com:matanhasson/qdecomp_with_dependency_graphs.git icl-demo-selection/src/third_party/
-
- [Optional] LLM-specific setup:
- For experiments with LlaMA models, set the path to the directory containing downloaded LlaMA weights in
langchain.llms.huggingface.get_model_cache_dir
. - Experiments with some LLMs may require setting up HuggingFace auth token by running
huggingface-cli login
. - Store the OpenAI key in
openai_keys.txt
in the root directory.
- For experiments with LlaMA models, set the path to the directory containing downloaded LlaMA weights in
The repository is organized as follows:
icl
├── data (local datasets -- download from https://1drv.ms/u/s!AqJNiE6C-nXuoawBxh-3rfUsSf4-8A?e=3o1YDK)
├── results (icl experiment results and logs)
├── src (relevant source files described below)
└── openai_keys.txt (any openai keys, one per line)
Important source files include:
src/params.py
defines experiment parameterssrc/data_params.py
defines the parameters for each dataset
src/constants.py
defines some useful enums and constantssrc/driver.py
is the main file to run a single ICL experiment. Instead of directly running this file, usesrc/experiments.py
-- it takes care of many default parameters and makes it easy to run multiple experiments.src/eval.py
used withinsrc/driver.py
to run the ICL evaluation
src/experiments.py
contains the code to run experiments, track experiment statuses and aggregate results. Instead, of directly it dumps the parameters for all the experiments to a file that is then used bysrc/run.py
. Runpython experiments.py --help
to see help.src/exp_utils.py
defines various default arguments
src/run.py
used to run one or more experiments sequentially or in parallel on one or more GPUs. It is the main file to run experiments.src/selector/
contains the implementations for the various selectorssrc/prompts/
contains templates for single examples and few-shot prompts
src/experiments.py
and src/run.py
are the main files to run ICL evaluations. The following are some example workflows:
-
Generate the parameters for 8-shot ICL with all the datasets, Neo and LLaMA-7B LLMs, with LLMs selected using Cosine, BERTscore, and GistScore selectors, and dump them to
params/all.jsonl
. Seeexperiments.main
for detailed usage.python experiments.py --label "test" --seeds 0 \ --datasets "QNLI;MNLI;RTE;SST2;YELP;MRPC;QQP;PAWS;COPA;PIQA;WINOGRANDE;WSC;CMSQA;COLA;COMMONGEN;E2ENLG;DART;SST5;AGNEWS;AESLC;SMCALFLOW_CS;BREAK;MTOP;COGS" \ --selectors "cosine;bertscore;gist_bertscore" \ --lms "llama-7B" \ --n-shots 8 --baselines-exp \ --paramsfile "params/all.jsonl" --run \ --no-collate-results \ --preview "logfiles"
-
Run the experiments in
params/all.jsonl
parallelly on gpus 0 and 1.python run.py --paramsfile "params/all.jsonl" --gpus "0,1"
NOTE: To run ICL evaluations with GistScore, see the gist-icl repo.
- Update
constants.Dataset
andconstants.category2datasets
. - Add a parameters class for it in
src/data_params.py
similar to all the other datasets.- If it requires a new metric, add it to
prompts/base.py
- Test it using
data_params.test_dataset
ordata_params.test
.
- If it requires a new metric, add it to
- For ICL evaluation, some of these might also be necessary (though rare):
- If it requires any default arguments, add them to
exp_utils.dataset_args_d
- It has more than one
split
s, add them toexp_utils.ds2splits
. If it has more than onetest_split
s, those will be recorded inexp_utils.dataset_args_d
(similar to COGS). - If it requires a new metric, add the name for that metric to the
metric_cols
lists inexperiments.make_tables
.
- If it requires any default arguments, add them to
There are two different types of command lines in this repository:
- Typer - this one is used for non-nested parameterization. Allows multiple commands in a single script run as
python <script> <command> <arguments>
. The<command>
only needs to be specified if there are more than one commands (eg.src/data_params.py
). The<arguments>
are specified a bit differently so try running with--help
to see them.src/experiments.py
:src/run.py
src/data_params.py
- Hydra - this one is used for more nested parameterization.
src/driver.py
: parameters defined in (src/params.py:AllParams
)
If you found this repository useful, please cite the following papers:
@inproceedings{gupta-etal-2023-coverage,
title = "Coverage-based Example Selection for In-Context Learning",
author = "Gupta, Shivanshu and
Gardner, Matt and
Singh, Sameer",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.930",
doi = "10.18653/v1/2023.findings-emnlp.930",
pages = "13924--13950",
}
@article{gupta2023gistscore,
title={GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks},
author={Shivanshu Gupta and Clemens Rosenbaum and Ethan R. Elenberg},
year={2023},
eprint={2311.09606},
archivePrefix={arXiv},
primaryClass={cs.CL}
}