use dharma to craft small or large benchmarking datasets that can be used during training or for fast evals. these serve as good indicators on the benchmarks you care about. make sure to craft a benchmark dataset appropriate for your use cases. more benchmarks and features are in the works to give you even more control over your bench datasets. dharma's core value is the idea of 'eval through time' during a training run. it sheds light to on your model's performance as it processes and is optimized on your training data. this can be useful to train more powerful models that do exactly what you intend them to. of course, MCQ based benches do not inform us much on performance beyond this format, therefore dharma will expand to include non MCQ based benches as well. stay tuned.
!pip install git+https://github.com/pharaouk/dharma
#SETUP config.yml file
#IN YOUR SCRIPT
import dharma
dharma.run_dharma('config.yml')
or
Clone and Setup:
git clone https://github.com/pharaouk/dharma.git
pip install -r requirements.txt
Configs:
output: #(string) dataset name, leave blank to use default
hf_namespace: #(string) hf username/namespace
hf_upload: false #(bool) hf username/namespace
hf_private: false #(bool) hf private? T/F
prompt_format: "Question: {questions}. {options} Answer:" #(string) prompt format to use for the eval datasets, not yet customizable
dataset_size: 2000 #(int) total target dataset size
data_seed: 42 #(int) dataset seed
force_dist: true #(bool) force even distribution for answers (i.e. A-25 B-25 C-25 D-25)
benchmarks: #this determines which benchmarks and counts/distirbutions for the target dataset. enter 0 if you don't want that dataset included.
mmlu:
count: 1
arc_c:
count: 1
arc_e:
count: 1
agieval:
count: 1
boolq:
count: 1
obqa:
count: 1
truthfulqa:
count: 1
winogrande:
count: 1
Run:
python dharma/dharma.py
or
python dharma/dharma.py --config <CONFIG_PATH>
How is Dharma used? Example dharma-1 dataset: https://huggingface.co/datasets/pharaouk/dharma-1 Example axolotl implementation: https://github.com/OpenAccess-AI-Collective/axolotl/blob/638c2dafb54f1c7c61a5f7ad40f8cf6965bec896/src/axolotl/core/trainer_builder.py#L152
#On Axolotl (in config.yml for your training run)
do_bench_eval: true
bench_dataset: <LINK_TO_JSON> (default="pharaouk/dharma-1/dharma_1_mini.json")
TODOS
- bigbench compatibility. [in progress] (currently not optimal)
- Custom prompt formats (to replace standard one we've set)
- standardize dataset cleaning funcs (add sim search and subject based segmentation)
- Add a testing/eval script with local llm w local lb
- Upload cleaned and corrected copies of all benchmrk datasets to HF
- Fix uneven distributions
- CLIx updates (tqdm + cleanup)
- pip package
- New benchmarks, non MCQ
- HF Compatible Custom Callback library with customization options
- better selection algo for the benchmarks
- Randomize answers options (could be useful to evaluate/minimize bias in model)
- More languages