CodeHaluEval is a comprehensive evaluation tool for assessing the performance of Large Language Models (LLMs) in code generation tasks. It includes 8,883 samples from 699 diverse programming tasks, specifically designed to quantify and understand the tendencies of LLMs to produce code hallucinations and other errors during code generation. Utilizing our newly developed CodeHalu dynamic detection algorithm, researchers can identify and categorize various types of code issues, thereby enhancing the model’s application efficacy in real-world programming environments.
For more detailed introduction of the data, please see the 🤗 Huggingface Dataset.
If you want to use some model APIs, you need to set variables in models.py
erniebot_api_key
: Your Paddle API key.gemini_api_key
= Your Google API key.openai_api_key
= Your OpenAI API key.claude_api_key
= Your Claude API key.
An example for GPT-4 generation:
python generation.py \
--model gpt4 \
--data_path <path_to_the_test_set> \
--save_path "results/gpt4_codehalu_test.jsonl"
To evaluate the results generated by GPT-4, run:
python eval.py \
--halu_type <The type of hallucination you want to evaluate.> \
--generation_file <File containing generations to be evaluated.>
Please consider citing if you find our work useful:
@misc{tian2024codehaluinvestigatingcodehallucinations,
title={CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification},
author={Yuchen Tian and Weixiang Yan and Qian Yang and Xuandong Zhao and Qian Chen and Wen Wang and Ziyang Luo and Lei Ma and Dawn Song},
year={2024},
eprint={2405.00253},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.00253},
}