Skip to content

Latest commit

 

History

History
32 lines (24 loc) · 960 Bytes

README.md

File metadata and controls

32 lines (24 loc) · 960 Bytes

Needle Test Evaluation

We reconstructed the original "Needle In A Haystack - Pressure Test" code to add support for evaluating HuggingFace models. The evaluation procedure involves 3 steps: Test prompt generation, model predicting, and scoring.

Test prompt generation

Configure the test parameters in config-prompt.yaml, then run

python prompt.py

The test prompts will be generated under prompts/.

Model predicting

Set your model path (or HuggingFace path) in config-pred.yaml, then run

CUDA_VISIBLE_DEVICES=0 python pred.py

The model prediction will be saved under pred/.

Scoring

Configure your scoring model (default as gpt-4) in config-eval.yaml, and your API key in eval.py. Run

python eval.py

The scoring result will be saved as json under results/.

Finally, visualize your result with

python vis.py