Skip to content

Latest commit

 

History

History
62 lines (47 loc) · 4.71 KB

EVAL.md

File metadata and controls

62 lines (47 loc) · 4.71 KB

Evaluate TinyLlama

GPT4All Benchmarks

We evaluate TinyLlama's commonsense reasoning ability following the GPT4All evaluation suite. We include Pythia as our baseline. We report the acc_norm by default.

Base models:

Model Pretrain Tokens HellaSwag Obqa WinoGrande ARC_c ARC_e boolq piqa avg
Pythia-1.0B 300B 47.16 31.40 53.43 27.05 48.99 60.83 69.21 48.30
TinyLlama-1.1B-intermediate-step-50K-104b 103B 43.50 29.80 53.28 24.32 44.91 59.66 67.30 46.11
TinyLlama-1.1B-intermediate-step-240k-503b 503B 49.56 31.40 55.80 26.54 48.32 56.91 69.42 48.28
TinyLlama-1.1B-intermediate-step-480k-1007B 1007B 52.54 33.40 55.96 27.82 52.36 59.54 69.91 50.22
TinyLlama-1.1B-intermediate-step-715k-1.5T 1.5T 53.68 35.20 58.33 29.18 51.89 59.08 71.65 51.29
TinyLlama-1.1B-intermediate-step-955k-2T 2T 54.63 33.40 56.83 28.07 54.67 63.21 70.67 51.64
TinyLlama-1.1B-intermediate-step-1195k-2.5T 2.5T 58.96 34.40 58.72 31.91 56.78 63.21 73.07 53.86
TinyLlama-1.1B-intermediate-step-1431k-3T 3T 59.20 36.00 59.12 30.12 55.25 57.83 73.29 52.99

Chat models:

Model Pretrain Tokens HellaSwag Obqa WinoGrande ARC_c ARC_e boolq piqa avg
TinyLlama-1.1B-Chat-v0.1 503B 53.81 32.20 55.01 28.67 49.62 58.04 69.64 49.57
TinyLlama-1.1B-Chat-v0.2 503B 53.63 32.80 54.85 28.75 49.16 55.72 69.48 49.20
TinyLlama-1.1B-Chat-v0.3 1T 56.81 34.20 55.80 30.03 53.20 59.57 69.91 51.36
TinyLlama-1.1B-Chat-v0.4 1.5T 58.59 35.40 58.80 30.80 54.04 57.31 71.16 52.30

We observed huge improvements once we finetuned the model. We attribute this phenomenon to: 1. the base model has not undergone lr cool-down and FT helps to cool down the lr. 2. the SFT stage better elicits the model's internal knowledge.

You can obtain the above scores by running lm-eval-harness:

python main.py \
    --model hf-causal \
    --model_args pretrained=PY007/TinyLlama-1.1B-Chat-v0.1,dtype="float" \
    --tasks hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa\
    --device cuda:0 --batch_size 32

Instruct-Eval Benchmarks

We evaluate TinyLlama's ability in problem-solving on the Instruct-Eval evaluation suite.

Model MMLU BBH HumanEval DROP
Pythia-1.0B 25.70 28.19 1.83 4.25
TinyLlama-1.1B-intermediate-step-50K-104b 26.45 28.82 5.49 11.42
TinyLlama-1.1B-intermediate-step-240k-503b 26.16 28.83 4.88 12.43
TinyLlama-1.1B-intermediate-step-480K-1T 24.65 29.21 6.1 13.03
TinyLlama-1.1B-intermediate-step-715k-1.5T 24.85 28.2 7.93 14.43
TinyLlama-1.1B-intermediate-step-955k-2T 25.97 29.07 6.71 13.14
TinyLlama-1.1B-intermediate-step-1195k-token-2.5T 25.92 29.32 9.15 15.45

You can obtain above scores by running instruct-eval:

CUDA_VISIBLE_DEVICES=0 python main.py mmlu --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=1 python main.py bbh --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=2 python main.py drop --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=3 python main.py humaneval  --model_name llama  --n_sample 1 --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T