We evaluate TinyLlama's commonsense reasoning ability following the GPT4All evaluation suite. We include Pythia as our baseline. We report the acc_norm by default.
Base models:
Model | Pretrain Tokens | HellaSwag | Obqa | WinoGrande | ARC_c | ARC_e | boolq | piqa | avg |
---|---|---|---|---|---|---|---|---|---|
Pythia-1.0B | 300B | 47.16 | 31.40 | 53.43 | 27.05 | 48.99 | 60.83 | 69.21 | 48.30 |
TinyLlama-1.1B-intermediate-step-50K-104b | 103B | 43.50 | 29.80 | 53.28 | 24.32 | 44.91 | 59.66 | 67.30 | 46.11 |
TinyLlama-1.1B-intermediate-step-240k-503b | 503B | 49.56 | 31.40 | 55.80 | 26.54 | 48.32 | 56.91 | 69.42 | 48.28 |
TinyLlama-1.1B-intermediate-step-480k-1007B | 1007B | 52.54 | 33.40 | 55.96 | 27.82 | 52.36 | 59.54 | 69.91 | 50.22 |
TinyLlama-1.1B-intermediate-step-715k-1.5T | 1.5T | 53.68 | 35.20 | 58.33 | 29.18 | 51.89 | 59.08 | 71.65 | 51.29 |
TinyLlama-1.1B-intermediate-step-955k-2T | 2T | 54.63 | 33.40 | 56.83 | 28.07 | 54.67 | 63.21 | 70.67 | 51.64 |
TinyLlama-1.1B-intermediate-step-1195k-2.5T | 2.5T | 58.96 | 34.40 | 58.72 | 31.91 | 56.78 | 63.21 | 73.07 | 53.86 |
TinyLlama-1.1B-intermediate-step-1431k-3T | 3T | 59.20 | 36.00 | 59.12 | 30.12 | 55.25 | 57.83 | 73.29 | 52.99 |
Chat models:
Model | Pretrain Tokens | HellaSwag | Obqa | WinoGrande | ARC_c | ARC_e | boolq | piqa | avg |
---|---|---|---|---|---|---|---|---|---|
TinyLlama-1.1B-Chat-v0.1 | 503B | 53.81 | 32.20 | 55.01 | 28.67 | 49.62 | 58.04 | 69.64 | 49.57 |
TinyLlama-1.1B-Chat-v0.2 | 503B | 53.63 | 32.80 | 54.85 | 28.75 | 49.16 | 55.72 | 69.48 | 49.20 |
TinyLlama-1.1B-Chat-v0.3 | 1T | 56.81 | 34.20 | 55.80 | 30.03 | 53.20 | 59.57 | 69.91 | 51.36 |
TinyLlama-1.1B-Chat-v0.4 | 1.5T | 58.59 | 35.40 | 58.80 | 30.80 | 54.04 | 57.31 | 71.16 | 52.30 |
We observed huge improvements once we finetuned the model. We attribute this phenomenon to: 1. the base model has not undergone lr cool-down and FT helps to cool down the lr. 2. the SFT stage better elicits the model's internal knowledge.
You can obtain the above scores by running lm-eval-harness:
python main.py \
--model hf-causal \
--model_args pretrained=PY007/TinyLlama-1.1B-Chat-v0.1,dtype="float" \
--tasks hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa\
--device cuda:0 --batch_size 32
We evaluate TinyLlama's ability in problem-solving on the Instruct-Eval evaluation suite.
Model | MMLU | BBH | HumanEval | DROP |
---|---|---|---|---|
Pythia-1.0B | 25.70 | 28.19 | 1.83 | 4.25 |
TinyLlama-1.1B-intermediate-step-50K-104b | 26.45 | 28.82 | 5.49 | 11.42 |
TinyLlama-1.1B-intermediate-step-240k-503b | 26.16 | 28.83 | 4.88 | 12.43 |
TinyLlama-1.1B-intermediate-step-480K-1T | 24.65 | 29.21 | 6.1 | 13.03 |
TinyLlama-1.1B-intermediate-step-715k-1.5T | 24.85 | 28.2 | 7.93 | 14.43 |
TinyLlama-1.1B-intermediate-step-955k-2T | 25.97 | 29.07 | 6.71 | 13.14 |
TinyLlama-1.1B-intermediate-step-1195k-token-2.5T | 25.92 | 29.32 | 9.15 | 15.45 |
You can obtain above scores by running instruct-eval:
CUDA_VISIBLE_DEVICES=0 python main.py mmlu --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=1 python main.py bbh --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=2 python main.py drop --model_name llama --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T
CUDA_VISIBLE_DEVICES=3 python main.py humaneval --model_name llama --n_sample 1 --model_path PY007/TinyLlama-1.1B-intermediate-step-480K-1T