Skip to content

Elfsong/Mercury

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mercury: A Code Efficiency Benchmark for Code Large Language Models 🪐

arXiv HuggingFace HuggingFace

  • Welcome to Mercury!
  • Mercury is the first code efficiency benchmark designed for code synthesis tasks.
  • It consists of 1,889 programming tasks covering diverse difficulty levels, along with test case generators that produce unlimited cases for comprehensive evaluation.

[October 8, 2024] Mercury has been accepted to NeurIPS 2024 🌟

[September 20, 2024] We release a way bigger dataset Venus, which supports more languages. It also provides Memory measurement other than Time.

[July 10, 2024] We are building Code Arena now for more efficient Code LLMs evaluation!

[June 24, 2024] We are currently working on the Multilingual Mercury 🚧

[May 26, 2024] Mercury is now available on BigCode 🌟

Mercury Datasets Access

We publish and maintain our datasets at Mercury@HF

How to use Mercury Evaluation

# Option 1 (with BigCode):
# See https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs#mercury
accelerate  launch --main_process_port 30003  main.py  \
    --model bigcode/starcoder2-7b   \
    --load_in_4bit   \
    --max_length_generation 2048   \
    --tasks mercury    \
    --n_samples 5  \
    --temperature 0.2  \
    --batch_size 5   \
    --allow_code_execution  \
    --save_generations  \
    --metric_output_path starcoder2-7b-mercury-result.json
# Option 2 (this library):
import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_KEY'

# Instantiate evaluator with model_name
# Set do_generate to True if you are going to load the specific language model during evaluator initialization.
from src import evaluator as Evaluator
evaluator = Evaluator.DistributeWiseEvaluator(model_name_or_path='openai/gpt-3.5-turbo-1106', do_generate=True)

# Generate code samples
evaluator.generate(num_samples_per_task=1)

# Evaluate code samples using the Mercury benchmark
evaluator.evaluate(num_samples_per_task=1)

Benchmark Visualization

mercury_benchmark

Citation

@inproceedings{du2024mercury,
  title={Mercury: A code efficiency benchmark for code large language models},
  author={Du, Mingzhe and Luu, Anh Tuan and Ji, Bin and Liu, Qian and Ng, See-Kiong},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024}
}

Questions?

Should you have any questions regarding this paper, please feel free to email us (mingzhe@nus.edu.sg). Thank you for your attention!