The following guidance will help you to reproduce our results on short-context tasks.
We leverage the evaluation data scripts in LEMA. The few-shot examples for GSM8K and MATH are chosen from their training set according to the input similarity.
Step 1: Inference with vLLM.
The test data in ./prompts/
have been formatted into the system template for FILM-7B.
# Extract Data
# Inference
export NCCL_IGNORE_DISABLED_P2P=1
python ../vllm_inference/vllm_inference.py --model_path In2Training/FILM-7B \
--testdata_file gsm8k_8shot.jsonl \
--testdata_folder ./prompts/ \
--output_folder ./results/FILM-7B/ \
--max_length 2048 \
--tensor_parallel_size 8
python ../vllm_inference/vllm_inference.py --model_path In2Training/FILM-7B \
--testdata_file math_4shot.jsonl \
--testdata_folder ./prompts/ \
--output_folder ./results/FILM-7B/ \
--max_length 2048 \
--tensor_parallel_size 8
python ../vllm_inference/vllm_inference.py --model_path In2Training/FILM-7B \
--testdata_file csqa_0shot.jsonl \
--testdata_folder ./prompts/ \
--output_folder ./results/FILM-7B/ \
--max_length 128 \
--tensor_parallel_size 8
We provide our generation results in ./results/
, including FILM-7B and Mistral-7B-Instruct-v0.2.
Step 2: Evaluation.
Run evaluation.py
to calculate evaluation metrics on different tasks.
python evaluation.py
We utilize the lm_eval for the evaluation on MMLU, BoolQ, RACE-H, ARC-C, and HellaSwag. The results could have slight variances with different versions of lm_eval.