RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion #749
Labels
AI-Agents
Autonomous AI agents using LLMs
Algorithms
Sorting, Learning or Classifying. All algorithms go here.
code-generation
code generation models and tools like copilot and aider
embeddings
vector embeddings and related tools
llm
Large Language Models
llm-benchmarks
testing and benchmarking large language models
llm-completions
large language models for completion tasks, e.g. copilot
llm-evaluation
Evaluating Large Language Models performance and behavior through human-written evaluation sets
llm-experiments
experiments with large language models
Models
LLM and ML model repos and links
multimodal-llm
LLMs that combine modes such as text and image recognition.
Papers
Research papers
prompt
Collection of llm prompts and notes
prompt-engineering
Developing and optimizing prompts to efficiently use language models for various applications and re
RepoHyper/README.md at main · FSoft-AI4Code/RepoHyper
RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion
Introduction
We introduce RepoHyper, an novel framework transforming code completion into a seamless end-to-end process for use case on real world repositories. Traditional approaches depend on integrating contexts into Code Language Models (CodeLLMs), often presuming these contexts to be inherently accurate. However, we've identified a gap: the standard benchmarks don't always present relevant contexts.
To address this, RepoHyper proposes in three novel steps:
Our comprehensive evaluations reveal that RepoHyper sets a new standard, outperforming other strong baseline on the RepoBench benchmark.
Installation
Architecture
RepoHyper is a two-stage model. The first stage is a search-then-expand algorithm on Repo-level Semantic Graph (RSG) then use GNN link predictor that reranks the retrieved results from KNN search and graph expansion. The second stage is any code LLM model that takes the retrieved context and predicts the next line of code.
Checkpoints
We provide the checkpoints for the GNN model here. The GNN model is trained on the RepoBench-R dataset with gold labels. We also provide RepoBench-R RGSs to reproduce the results.
Usage
Data preparation
We need to clone Repobench dataset into
data/repobench
folder. Then download all the unique repositories used in this datasetThe next step is to generate call graph using PyCG. We use the following command to generate call graph for each repository. 60 processes are used to speed up the process (maximum RAM usage is around 350GB).
Now we need to generate embeddings for each node for node embedding as well as create adjacency matrix by aligning Tree-sitter functions, classes, methods with call graph nodes.
Final step is labeling which node is the most optimal for predicting next line using gold snippet from repobench dataset. In this step, we also generate the training data for GNN training by extracting the subgraph using KNN search and RSG expansion.
Training
We can train GNN linker separately using following script
Evaluation for RepoBench-P
We can evaluate the model using the following script
python3 scripts/evaluate_llm.py --data data/repobench/repos_graphs_matched_retrieved --model "gpt3.5" --num-workers 8
URL: RepoHyper README
Suggested labels
The text was updated successfully, but these errors were encountered: