Skip to content

pharaouk/dharma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dharma: build your own tiny benchmark datasets

🤗 HF Repo • 🐦 Twitter

use dharma to craft small or large benchmarking datasets that can be used during training or for fast evals. these serve as good indicators on the benchmarks you care about. make sure to craft a benchmark dataset appropriate for your use cases. more benchmarks and features are in the works to give you even more control over your bench datasets. dharma's core value is the idea of 'eval through time' during a training run. it sheds light to on your model's performance as it processes and is optimized on your training data. this can be useful to train more powerful models that do exactly what you intend them to. of course, MCQ based benches do not inform us much on performance beyond this format, therefore dharma will expand to include non MCQ based benches as well. stay tuned.

Quickstart

!pip install git+https://github.com/pharaouk/dharma

#SETUP config.yml file

#IN YOUR SCRIPT

import dharma
dharma.run_dharma('config.yml')

or

Clone and Setup:

git clone https://github.com/pharaouk/dharma.git
pip install -r requirements.txt

Configs:

output: #(string) dataset name, leave blank to use default

hf_namespace: #(string)  hf username/namespace
hf_upload: false  #(bool) hf username/namespace
hf_private: false #(bool) hf private? T/F

prompt_format: "Question: {questions}. {options} Answer:"  #(string) prompt format to use for the eval datasets, not yet customizable

dataset_size: 2000  #(int) total target dataset size

data_seed: 42  #(int) dataset seed

force_dist: true  #(bool) force even distribution for answers (i.e. A-25 B-25 C-25 D-25)

benchmarks: #this determines which benchmarks and counts/distirbutions for the target dataset. enter 0 if you don't want that dataset included.

  mmlu: 
    count: 1
  arc_c:
    count: 1
  arc_e:
    count: 1
  agieval:
    count: 1
  boolq:
    count: 1
  obqa:
    count: 1
  truthfulqa:
    count: 1
  winogrande:
    count: 1

Run:

python dharma/dharma.py

or

python dharma/dharma.py --config <CONFIG_PATH>

How is Dharma used? Example dharma-1 dataset: https://huggingface.co/datasets/pharaouk/dharma-1 Example axolotl implementation: https://github.com/OpenAccess-AI-Collective/axolotl/blob/638c2dafb54f1c7c61a5f7ad40f8cf6965bec896/src/axolotl/core/trainer_builder.py#L152

#On Axolotl (in config.yml for your training run)
do_bench_eval: true
bench_dataset: <LINK_TO_JSON> (default="pharaouk/dharma-1/dharma_1_mini.json")

Example wandb: Wandb

TODOS

  1. bigbench compatibility. [in progress] (currently not optimal)
  2. Custom prompt formats (to replace standard one we've set)
  3. standardize dataset cleaning funcs (add sim search and subject based segmentation)
  4. Add a testing/eval script with local llm w local lb
  5. Upload cleaned and corrected copies of all benchmrk datasets to HF
  6. Fix uneven distributions
  7. CLIx updates (tqdm + cleanup)
  8. pip package
  9. New benchmarks, non MCQ
  10. HF Compatible Custom Callback library with customization options
  11. better selection algo for the benchmarks
  12. Randomize answers options (could be useful to evaluate/minimize bias in model)
  13. More languages

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages