Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion #749

Open
1 task
irthomasthomas opened this issue Mar 16, 2024 · 1 comment
Labels
AI-Agents Autonomous AI agents using LLMs Algorithms Sorting, Learning or Classifying. All algorithms go here. code-generation code generation models and tools like copilot and aider embeddings vector embeddings and related tools llm Large Language Models llm-benchmarks testing and benchmarking large language models llm-completions large language models for completion tasks, e.g. copilot llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets llm-experiments experiments with large language models Models LLM and ML model repos and links multimodal-llm LLMs that combine modes such as text and image recognition. Papers Research papers prompt Collection of llm prompts and notes prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re

Comments

@irthomasthomas
Copy link
Owner

RepoHyper/README.md at main · FSoft-AI4Code/RepoHyper

RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion

arXiv

Introduction

We introduce RepoHyper, an novel framework transforming code completion into a seamless end-to-end process for use case on real world repositories. Traditional approaches depend on integrating contexts into Code Language Models (CodeLLMs), often presuming these contexts to be inherently accurate. However, we've identified a gap: the standard benchmarks don't always present relevant contexts.

To address this, RepoHyper proposes in three novel steps:

  • Construction of a Code Property Graph, establishing a rich source of context.
  • A novel Search Algorithm for pinpointing the exact context needed.
  • The Expand Algorithm, designed to uncover implicit connections between code elements (akin to the Link Prediction problem on social network mining).

Our comprehensive evaluations reveal that RepoHyper sets a new standard, outperforming other strong baseline on the RepoBench benchmark.

Installation

pip install -r requirements.txt

Architecture

RepoHyper is a two-stage model. The first stage is a search-then-expand algorithm on Repo-level Semantic Graph (RSG) then use GNN link predictor that reranks the retrieved results from KNN search and graph expansion. The second stage is any code LLM model that takes the retrieved context and predicts the next line of code.

Checkpoints

We provide the checkpoints for the GNN model here. The GNN model is trained on the RepoBench-R dataset with gold labels. We also provide RepoBench-R RGSs to reproduce the results.

Usage

Data preparation

We need to clone Repobench dataset into data/repobench folder. Then download all the unique repositories used in this dataset

python3 -m scripts.data.download_repos --dataset data/repobench --output data/repobench/repos --num-processes 8

The next step is to generate call graph using PyCG. We use the following command to generate call graph for each repository. 60 processes are used to speed up the process (maximum RAM usage is around 350GB).

python3 -m scripts.data.generate_call_graph --repos data/repobench/repos --output data/repobench/repos_call_graphs --num-processes 60

Now we need to generate embeddings for each node for node embedding as well as create adjacency matrix by aligning Tree-sitter functions, classes, methods with call graph nodes.

python3 -m scripts.data.repo_to_embeddings --repos data/repobench/repos --call-graphs data/repobench/repos_call_graphs --output data/repobench/repos_graphs --num-processes 60

Final step is labeling which node is the most optimal for predicting next line using gold snippet from repobench dataset. In this step, we also generate the training data for GNN training by extracting the subgraph using KNN search and RSG expansion.

python3 -m scripts.data.matching_repobench_graphs -search_policy "knn-pattern" --rsg_path "YOUR RSG PATH" --output data/repobench/repos_graphs_labeled 

Training

We can train GNN linker separately using following script

CUDA_VISIBLE_DEVICES=0 deepspeed train_gnn.py --deepspeed --deepspeed_config ds_config.json --arch GraphSage --layers 1 --data-path data/repobench/repos_graphs_labeled_cosine_radius_unix --output data/repobench/gnn_model --num-epochs 10 --batch-size 16

Evaluation for RepoBench-P

We can evaluate the model using the following script

python3 scripts/evaluate_llm.py --data data/repobench/repos_graphs_matched_retrieved --model "gpt3.5" --num-workers 8

URL: RepoHyper README

Suggested labels

@irthomasthomas irthomasthomas added AI-Agents Autonomous AI agents using LLMs Algorithms Sorting, Learning or Classifying. All algorithms go here. code-generation code generation models and tools like copilot and aider dataset public datasets and embeddings embeddings vector embeddings and related tools finetuning Tools for finetuning of LLMs e.g. SFT or RLHF llm Large Language Models llm-benchmarks testing and benchmarking large language models llm-completions large language models for completion tasks, e.g. copilot llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets llm-experiments experiments with large language models llm-function-calling Function Calling with Large Language Models llm-inference-engines Software to run inference on large language models llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models Models LLM and ML model repos and links multimodal-llm LLMs that combine modes such as text and image recognition. prompt Collection of llm prompts and notes prompt-tuning labels Mar 16, 2024
@irthomasthomas
Copy link
Owner Author

Related content

#498

Similarity score: 0.9

#383

Similarity score: 0.9

#734

Similarity score: 0.89

#324

Similarity score: 0.89

#662

Similarity score: 0.89

#515

Similarity score: 0.89

@irthomasthomas irthomasthomas changed the title RepoHyper/README.md at main · FSoft-AI4Code/RepoHyper RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion Mar 16, 2024
@irthomasthomas irthomasthomas added prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re Papers Research papers and removed prompt-tuning llm-function-calling Function Calling with Large Language Models llm-inference-engines Software to run inference on large language models llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models finetuning Tools for finetuning of LLMs e.g. SFT or RLHF dataset public datasets and embeddings labels May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI-Agents Autonomous AI agents using LLMs Algorithms Sorting, Learning or Classifying. All algorithms go here. code-generation code generation models and tools like copilot and aider embeddings vector embeddings and related tools llm Large Language Models llm-benchmarks testing and benchmarking large language models llm-completions large language models for completion tasks, e.g. copilot llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets llm-experiments experiments with large language models Models LLM and ML model repos and links multimodal-llm LLMs that combine modes such as text and image recognition. Papers Research papers prompt Collection of llm prompts and notes prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re
Projects
None yet
Development

No branches or pull requests

1 participant