Needle Test Evaluation

We reconstructed the original "Needle In A Haystack - Pressure Test" code to add support for evaluating HuggingFace models. The evaluation procedure involves 3 steps: Test prompt generation, model predicting, and scoring.

Test prompt generation

Configure the test parameters in config-prompt.yaml, then run

python prompt.py

The test prompts will be generated under prompts/.

Model predicting

Set your model path (or HuggingFace path) in config-pred.yaml, then run

CUDA_VISIBLE_DEVICES=0 python pred.py

The model prediction will be saved under pred/.

Scoring

Configure your scoring model (default as gpt-4) in config-eval.yaml, and your API key in eval.py. Run

python eval.py

The scoring result will be saved as json under results/.

Finally, visualize your result with

python vis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Needle Test Evaluation

Test prompt generation

Model predicting

Scoring

Files

README.md

Latest commit

History

README.md

File metadata and controls

Needle Test Evaluation

Test prompt generation

Model predicting

Scoring