The performance of large language models on programming tasks is impressive, but many datasets suffer from data leakage, particularly in benchmarks like HumanEval and MBPP. To tackle this, we introduce the XCoder-Complexity-Scorer, which control code instruction-tuning data quality across three key dimensions: instruction complexity, response quality, and diversity. We also traine a Unit Test Model to generate unit test programs for each candidate solution. On this basis, we developed XCoder, a family of models fine-tuned from LLaMA3. Alongside the XCoder-80K Dataset, we release XCoder-8B and XCoder-70B. Our experiments show that XCoder achieves state-of-the-art performance with less training data, validating our data strategy.
📖 Paper • 🤖️ XCoder-8B Model • 🤖️ XCoder-70B Model • 🤗 XCoder-80K Dataset
• 👉 XCoder-Complexity-Scorer • 👉 Unit Test Model
📃 Read our Paper on arxiv .
📚 Get our Dataset on huggingface.
🕊 Try our Coder: Get XCoder-8B from huggingface or modelscope.
🕊 Try our Coder: Get XCoder-70B form huggingface or modelscope.
🐬 We train a model to score the complexity of each instruction: Get Complexity Scorer from huggingface or modelscope. You can use the complexity inference file to inference the complexity of the query in each turn. Thanks for deita!
🐋 We trained a model to generate unit test programs for each candidate solution: Get Unit Test Model from huggingface or modelscope.
The performance of large language models on programming tasks is impressive, but many datasets suffer from data leakage, especially on benchmarks like HumanEval and MBPP. To address this, we introduce the Test Leakage Indicator (TLI), which identifies high-leakage data, and cleans it. We also evaluate it on cleaner benchmarks, LiveCodeBench and BigCodeBench, using filtered data on LLaMA3. We release our high-quality
Our findings reveal that some widely used datasets, like Magicoder-Evol-Instruct, are less reliable than previously thought. Inspired by alignment and mathematical data selection works, we select training data based on instruction complexity, code pass rate, and diversity. With just 40K examples, our model XCoder matches top performance and surpasses prior results at 80K.
Beyond cleaner data, we aim to redefine what makes a good Code Instruction Tuning dataset, analyzing previous works through XCoder's three key dimensions: 🎉🎉 New Insights For Code Instruction Data Synthesis.
from complexity import Scorer
model_name_or_path = "banksy235/XCoder-Complexity-Scorer"
scorer = Scorer(model_name_or_path,is_vllm=True)
query = "Your query"
complexity_score = scorer.infer_complexity(query)
If your data has multiple turns, you can score it turn by turn without history. For example, if data is
[{"role": "user", "value": "query1"}, {"role": "assistant", "value": "response1"}, {"role": "user", "value": "query2"}, {"role": "assistant", "value": "response2"}]
You should apply the scorer like
complexity_score = [scorer.infer_complexity(query1),scorer.infer_complexity(query2)]
python3 compute_TLI.py \
--train_data_path {train_dataset} \
--test_data_path {test_dataset} \
--key_train {key name of the instruction in the training data JSON} \
--key_test {key name of the instruction in the test data JSON} \
--only_analysis true
We construct a data pool that includes many open-source code instruction fine-tuning datasets. The specific datasets are listed in the table below:
Dataset | Data Size | Instruction Source | Response Source |
---|---|---|---|
Code-290k-ShareGPT-Vicuna-Clean | 289k | - | - |
CodeExercise-Python-27k | 27k | GPT | GPT |
CodeUp | 19k | GPT(Self-Instruct) | GPT |
Glaive-code-assistant-v3 | 950k | Glaive | Glaive |
oa_leet_10k | 23k | - | - |
Code-Alpaca | 20k | GPT(Self-Instruct) | GPT |
Codefuse-Evol-Instruct-Clean | 66k | GPT(Evol-Instruct) | GPT |
DolphCoder | 79k | GPT(Evol-Instruct) | GPT |
Magiccoder-Evol-Instruct-Clean | 110k | GPT(Evol-Instruct) | GPT |
MagicCoder-OSS-Instruct | 75k | GPT(OSS-Instruct) | GPT |
CommitPackFT | 702k | GitHub | GitHub |
StarCoder-Self-Align | 50k | StarCoder2(OSS-Instruct) | StarCoder2 |
Leet10k_alpaca | 10k | - | - |
Code-Feedback-Clean | 66k | GPT | GPT |
- The dataset with the "Clean" suffix implies that the original dataset contains data leakage. We use the cleaned version.
XCoder selects good samples based on three dimensions: instruction complexity, response quality, and instruction diversity.
- Instruction complexity: People always hope that Code LLM can write more complex programs.Thus, we train a Complexity Scorer to measure the complexity of each sample.
- Response quality: We use the number of passed test cases as a measure of code coverage quality. We train a Unit Test Model to generate a unit test program for each sample. Compared to using language models directly to judge code correctness, executing test cases can obtain real-world feedback and have better judgment performance.
- Instruction diversity: As a general principle, an advanced LLM should be able to handle various requests from humans. We use Diversity-based Sampling method to ensure the diversity of the selected data.
Dataset | Size | LiveCodeBench Pass@1 | LiveCodeBench Easy-Pass@1 | BigCodeBench Pass@1 | HumanEval Base-Pass@1 | HumanEval Plus-Pass@1 |
---|---|---|---|---|---|---|
Code-Alpaca | 20k | 0.0 | 0.0 | 11.9 | 30.5 | 25.6 |
StarCoder2-Self-Align | 50k | 9.5 | 24.7 | 14.5 | 37.8 | 34.8 |
Codefuse-Evol-Instruct* | 66k | 12.3 | 33.1 | 25.4 | 59.1 | 53.7 |
Magicoder-OSS-Instruct | 75k | 12.8 | 33.8 | 22.0 | 54.3 | 50.0 |
Magicoder-Evol-Instruct* | 100k | 13.0 | 34.5 | 21.8 | 65.9 | 59.8 |
Code-Feedback | 64k | 14.8 | 38.0 | 27.0 | 56.7 | 51.8 |
XCoder | 40k | 16.5 | 43.7 | 27.4 | 54.9 | 50.6 |
XCoder | 80k | 16.8 | 43.7 | 29.6 | 57.3 | 53.0 |
- * means that the original dataset may have data leakage, and we perform a n-gram decontamination.
We analyze XCoder's data composition, reassess various data sources, and gain new insights into data synthesis. Our key findings:
- Complexity: Training models to assess instruction complexity outperforms heuristic methods. Evol-Instruct is effective for enhancing complexity, especially with longer, multi-round contexts.
- Quality: Test case execution provides better feedback for judging code correctness than model-based heuristics. Stronger models also yield higher-quality synthesized data.
- Diversity: Diverse instruction tuning is crucial. Real-world data sampling leads to better diversity than expanding instructions from fixed seeds.
Please kindly cite our paper if it helps your research:
@misc{wang2024codellmsperformempowering,
title={How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data},
author={Yejie Wang and Keqing He and Dayuan Fu and Zhuoma Gongque and Heyang Xu and Yanxu Chen and Zhexu Wang and Yujia Fu and Guanting Dong and Muxi Diao and Jingang Wang and Mengdi Zhang and Xunliang Cai and Weiran Xu},
year={2024},
eprint={2409.03810},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2409.03810},
}