granite-completebench

Note

See the granite-completebench website to explore the data.

granite-completebench

This is an evaluation tool for how LLM models work for code autocompletio. granite-completebench is used for the development of Granite.Code. In particular, we're interested in the combination of:

Prompts and postprocessing in the style of Continue.
The IBM Granite Models

However, other models are included here for comparison.

CrossCodeEval relationship

The dataset and some of the evaluation and inference code comes from CrossCodeEval. However, the usage is a bit different. In particular, we're looking at "fill in the middle" behavior where both a prefix and a suffix are used. While the CrossCodeEval dataset includes suffixes, they were not used during generation, allowing models without FIM support to be evaluated. For this and other reasons, the metrics we measure here will be different than reported for CrossCodeEval.

Limitations

The CrossCodeEval dataset is testing completion only in a very limited circumstance: completing from a point within a line to the end of the same line. In addition, the snippets included with the CrossCodeEval dataset do not typically include sufficient information to determine an exact matching completion. Performance of models in other circumstances might be different.

The metrics do not execute or even parse the generated code, we're just looking for an exact match or textually similar code.

Installation

Create a virtualenv and install dependencies:

python3.11 -m venv .venv && . .venv/bin/activate
pip install -e .
# For inference via vllm
pip install -e .[vllm]

Uncompress the CrossCodeEval data:

tar -xvJf data/crosscodeeval_data.tar.xz -C data/`

Generating model outputs

granite-codebench generate-vllm \
    --model=granite3.3:8b-base \
    --task=line_completion_rg1_openai_cosine_sim \
    --temperature=0 \
    --language=java \
    --template=comment

(The openai_cosine_sim task is used because in the CrossCodeEval paper, this method of producing RAG snippets to include for the model worked slightly better than the alternatives they tested.)

Evaluating model outputs

granite-codebench evaluate \
    --model=granite3.3:8b-base \
    --task=line_completion_rg1_openai_cosine_sim \
    --language=java \
    --template=comment \
    --postprocess=truncate_suffix_comment

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
data		data
granite_completebench		granite_completebench
prompt_builder		prompt_builder
web		web
.gitignore		.gitignore
.prettierrc		.prettierrc
GLOSSARY.md		GLOSSARY.md
LICENSE		LICENSE
NOTICE		NOTICE
README-cceval.md		README-cceval.md
README.md		README.md
THIRD_PARTY_LICENSES		THIRD_PARTY_LICENSES
evaluate-granite-benchmarks.sh		evaluate-granite-benchmarks.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run-granite-benchmarks-ollama.sh		run-granite-benchmarks-ollama.sh
run-granite-benchmarks.sh		run-granite-benchmarks.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

granite-completebench

CrossCodeEval relationship

Limitations

Installation

Generating model outputs

Evaluating model outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

Granite-Code/granite-completebench

Folders and files

Latest commit

History

Repository files navigation

granite-completebench

CrossCodeEval relationship

Limitations

Installation

Generating model outputs

Evaluating model outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages