Adding numbered hunks and code suggestions feature #50

mrT23 · 2023-07-15T07:08:30Z

Type of PR:
Enhancement

PR Description:
This PR introduces a new feature that allows the conversion of hunks in a patch to numbered hunks, and the generation of code suggestions for a PR. It also includes the necessary changes to support this feature in the Github and Gitlab providers, and updates the CLI and configuration settings accordingly.

PR Main Files Walkthrough:
-pr_agent/algo/git_patch_processing.py: Added a new function convert_to_hunks_with_lines_numbers that converts a patch to numbered hunks. This function is used to generate more detailed diffs for code suggestions.
-pr_agent/algo/pr_processing.py: Updated the get_pr_diff function to support the addition of line numbers to hunks. This is used when generating the PR diff for code suggestions.
-pr_agent/cli.py: Added a new command line argument --pr_code_suggestions to trigger the code suggestions feature.
-pr_agent/git_providers/git_provider.py: Added a new abstract method publish_code_suggestion to the GitProvider base class, which is implemented by the Github and Gitlab providers.
-pr_agent/git_providers/github_provider.py: Implemented the publish_code_suggestion method for the Github provider.
-pr_agent/git_providers/gitlab_provider.py: Implemented the publish_code_suggestion method for the Gitlab provider, but it currently raises a 'not implemented yet' exception.
-pr_agent/tools/pr_code_suggestions.py: Added a new class PRCodeSuggestions that handles the generation and publishing of code suggestions for a PR.
-pr_agent/settings/configuration.toml: Updated the configuration settings to include a new setting for the number of code suggestions to generate.
-pr_agent/settings/pr_code_suggestions_prompts.toml: Added a new file that contains the prompts for the code suggestions feature.

pr_agent/algo/pr_processing.py

pr_agent/tools/pr_code_suggestions.py

pr_agent/settings/configuration.toml

pr_agent/algo/pr_processing.py

pr_agent/tools/pr_code_suggestions.py

pr_agent/algo/git_patch_processing.py

pr_agent/cli.py

pr_agent/algo/pr_processing.py

mrT23 · 2023-07-16T05:44:34Z

Preparing review...

mrT23 · 2023-07-16T05:48:25Z

Preparing review...

mrT23 · 2023-07-16T05:48:31Z

Preparing review...

mrT23 · 2023-07-16T05:51:32Z

Preparing review...

mrT23 · 2023-07-16T05:59:42Z

pr_agent/algo/pr_processing.py

+def get_pr_diff(git_provider: Union[GithubProvider, Any], token_handler: TokenHandler,
+                add_line_numbers_to_hunks: bool = False, disable_extra_lines: bool =False) -> str:


Suggestion: Consider using a class or a data structure to encapsulate the parameters of the get_pr_diff function. This will make the function signature cleaner and easier to manage as the number of parameters grows.

Suggested change

def get_pr_diff(git_provider: Union[GithubProvider, Any], token_handler: TokenHandler,

add_line_numbers_to_hunks: bool = False, disable_extra_lines: bool =False) -> str:

class PrDiffParams:

def __init__(self, git_provider, token_handler, add_line_numbers_to_hunks=False, disable_extra_lines=False):

self.git_provider = git_provider

self.token_handler = token_handler

self.add_line_numbers_to_hunks = add_line_numbers_to_hunks

self.disable_extra_lines = disable_extra_lines

def get_pr_diff(params: PrDiffParams) -> str:

mrT23 · 2023-07-16T05:59:43Z

pr_agent/tools/pr_code_suggestions.py

+        self.vars = {
+            "title": self.git_provider.pr.title,
+            "branch": self.git_provider.get_pr_branch(),
+            "description": self.git_provider.get_pr_description(),
+            "language": self.main_language,
+            "diff": "",  # empty diff for initial calculation
+            'num_code_suggestions': settings.pr_code_suggestions.num_code_suggestions,
+        }


Suggestion: Consider using a data class or a named tuple for the vars attribute in the PRCodeSuggestions class. This will make the code more readable and easier to maintain.

Suggested change

self.vars = {

"title": self.git_provider.pr.title,

"branch": self.git_provider.get_pr_branch(),

"description": self.git_provider.get_pr_description(),

"language": self.main_language,

"diff": "", # empty diff for initial calculation

'num_code_suggestions': settings.pr_code_suggestions.num_code_suggestions,

}

from dataclasses import dataclass

@dataclass

class PrVars:

title: str

branch: str

description: str

language: str

diff: str

num_code_suggestions: int

self.vars = PrVars(

title=self.git_provider.pr.title,

branch=self.git_provider.get_pr_branch(),

description=self.git_provider.get_pr_description(),

language=self.main_language,

diff="", # empty diff for initial calculation

num_code_suggestions=settings.pr_code_suggestions.num_code_suggestions

)

mrT23 · 2023-07-16T05:59:44Z

pr_agent/algo/git_patch_processing.py

+def convert_to_hunks_with_lines_numbers(patch: str, file) -> str:
+    # toDO: (maybe remove '-' and '+' from the beginning of the line)
+    """
+    ## src/file.ts
+--new hunk--
+881        line1
+882        line2
+883        line3
+884        line4
+885        line6
+886        line7
+887 +      line8
+888 +      line9
+889        line10
+890        line11
+...
+--old hunk--
+        line1
+        line2
+-       line3
+-       line4
+        line5
+        line6
+           ...
+
+    """
+    patch_with_lines_str = f"## {file.filename}\n"
+    import re
+    patch_lines = patch.splitlines()
+    RE_HUNK_HEADER = re.compile(
+        r"^@@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? @@[ ]?(.*)")
+    new_content_lines = []
+    old_content_lines = []
+    match = None
+    start1, size1, start2, size2 = -1, -1, -1, -1
+    for line in patch_lines:
+        if 'no newline at end of file' in line.lower():
+            continue
+
+        if line.startswith('@@'):
+            match = RE_HUNK_HEADER.match(line)
+            if match and new_content_lines:  # found a new hunk, split the previous lines
+                if new_content_lines:
+                    patch_with_lines_str += '\n--new hunk--\n'
+                    for i, line_new in enumerate(new_content_lines):
+                        patch_with_lines_str += f"{start2 + i} {line_new}\n"
+                if old_content_lines:
+                    patch_with_lines_str += '--old hunk--\n'
+                    for i, line_old in enumerate(old_content_lines):
+                        patch_with_lines_str += f"{line_old}\n"
+                new_content_lines = []
+                old_content_lines = []
+            start1, size1, start2, size2 = map(int, match.groups()[:4])
+        elif line.startswith('+'):
+            new_content_lines.append(line)
+        elif line.startswith('-'):
+            old_content_lines.append(line)
+        else:
+            new_content_lines.append(line)
+            old_content_lines.append(line)
+
+    # finishing last hunk
+    if match and new_content_lines:
+        if new_content_lines:
+            patch_with_lines_str += '\n--new hunk--\n'
+            for i, line_new in enumerate(new_content_lines):
+                patch_with_lines_str += f"{start2 + i} {line_new}\n"
+        if old_content_lines:
+            patch_with_lines_str += '\n--old hunk--\n'
+            for i, line_old in enumerate(old_content_lines):
+                patch_with_lines_str += f"{line_old}\n"
+


Suggestion: The convert_to_hunks_with_lines_numbers function is quite long and does multiple things. Consider breaking it down into smaller, more manageable functions. This will make the code easier to read and maintain.

Suggested change

def convert_to_hunks_with_lines_numbers(patch: str, file) -> str:

# toDO: (maybe remove '-' and '+' from the beginning of the line)

"""

## src/file.ts

--new hunk--

881 line1

882 line2

883 line3

884 line4

885 line6

886 line7

887 + line8

888 + line9

889 line10

890 line11

...

--old hunk--

line1

line2

- line3

- line4

line5

line6

...

"""

patch_with_lines_str = f"## {file.filename}\n"

import re

patch_lines = patch.splitlines()

RE_HUNK_HEADER = re.compile(

r"^@@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? @@[ ]?(.*)")

new_content_lines = []

old_content_lines = []

match = None

start1, size1, start2, size2 = -1, -1, -1, -1

for line in patch_lines:

if 'no newline at end of file' in line.lower():

continue

if line.startswith('@@'):

match = RE_HUNK_HEADER.match(line)

if match and new_content_lines: # found a new hunk, split the previous lines

if new_content_lines:

patch_with_lines_str += '\n--new hunk--\n'

for i, line_new in enumerate(new_content_lines):

patch_with_lines_str += f"{start2 + i} {line_new}\n"

if old_content_lines:

patch_with_lines_str += '--old hunk--\n'

for i, line_old in enumerate(old_content_lines):

patch_with_lines_str += f"{line_old}\n"

new_content_lines = []

old_content_lines = []

start1, size1, start2, size2 = map(int, match.groups()[:4])

elif line.startswith('+'):

new_content_lines.append(line)

elif line.startswith('-'):

old_content_lines.append(line)

else:

new_content_lines.append(line)

old_content_lines.append(line)

# finishing last hunk

if match and new_content_lines:

if new_content_lines:

patch_with_lines_str += '\n--new hunk--\n'

for i, line_new in enumerate(new_content_lines):

patch_with_lines_str += f"{start2 + i} {line_new}\n"

if old_content_lines:

patch_with_lines_str += '\n--old hunk--\n'

for i, line_old in enumerate(old_content_lines):

patch_with_lines_str += f"{line_old}\n"

def convert_to_hunks_with_lines_numbers(patch: str, file) -> str:

# toDO: (maybe remove '-' and '+' from the beginning of the line)

...

patch_with_lines_str = process_patch_lines(patch_lines)

return patch_with_lines_str.strip()

def process_patch_lines(patch_lines):

...

return patch_with_lines_str

mrT23 · 2023-07-16T05:59:46Z

pr_agent/cli.py

+    parser.add_argument('--pr_description', action='store_true', required=False)
+    parser.add_argument('--pr_code_suggestions', action='store_true', required=False)


Suggestion: Consider using subcommands instead of flags for the different modes of operation (pr_description, pr_code_suggestions). This will make the command line interface more intuitive and easier to use.

Suggested change

parser.add_argument('--pr_description', action='store_true', required=False)

parser.add_argument('--pr_code_suggestions', action='store_true', required=False)

subparsers = parser.add_subparsers(dest='command')

pr_description_parser = subparsers.add_parser('pr_description')

pr_code_suggestions_parser = subparsers.add_parser('pr_code_suggestions')

full code suggestions

hussam789 · 2023-07-16T08:37:31Z

ToDo:

Add Code Suggestion to GitHub polling in handle_request

UjjawalKRoy · 2023-10-19T08:57:26Z

Preparing review...

UjjawalKRoy · 2023-10-19T08:58:58Z

Preparing review...

UjjawalKRoy · 2023-10-19T08:59:56Z

Preparing review...

UjjawalKRoy · 2023-10-19T09:01:00Z

Preparing review...

UjjawalKRoy · 2023-10-19T09:03:04Z

Preparing review...

UjjawalKRoy · 2023-10-19T09:16:19Z

Preparing review...

UjjawalKRoy · 2023-10-19T09:32:11Z

Preparing review...

UjjawalKRoy · 2023-10-19T09:36:00Z

Preparing review...

UjjawalKRoy · 2023-10-19T09:38:08Z

Preparing review...

UjjawalKRoy · 2023-10-19T09:41:59Z

Preparing review...

UjjawalKRoy · 2023-10-19T09:46:27Z

Preparing review...

UjjawalKRoy · 2023-10-19T10:22:06Z

Preparing review...

Adding numbered hunks and code suggestions feature

ElonaZharri · 2024-04-17T18:47:21Z

Preparing review...

ElonaZharri · 2024-04-17T23:11:27Z

Preparing PR description...

ElonaZharri · 2024-04-17T23:33:08Z

Question: Why PR Review is failing for local run through source code

Answer:
The provided code is a Python implementation of a PR tool that generates a diff between two versions of a file, taking into account the language of the file and the token budget. Here's a breakdown of the code:

The get_pr_diff function takes in a GithubProvider object and a TokenHandler object as inputs. It returns a tuple containing the full diff string and the total number of tokens used in the diff.
The pr_generate_extended_diff function takes in a list of languages and a TokenHandler object as inputs. It generates a standard diff string with patch extension, and then prunes the diff if the total number of tokens exceeds the limit.
The pr_generate_compressed_diff function takes in a list of languages, a TokenHandler object, and an optional convert_hunks_to_line_numbers parameter. It generates a compressed diff string by applying Diff Minimization techniques to reduce the number of tokens.
The token_handler object is used to count the number of tokens in each file.
The pr_languages list contains the languages for which the diff is being generated.
The patches list stores the generated diff patches.
The total_tokens variable keeps track of the total number of tokens used in the diff.
The add_line_numbers_to_hunks parameter is used to convert hunks to line numbers if necessary.

The code uses several techniques to reduce the number of tokens in the diff, including:

Starting from the largest diff patch and pruning smaller ones.
Not using extend context lines around diff.
Applying Diff Minimization techniques to reduce the number of tokens.
Converting hunks to line numbers if necessary.

The code also provides additional information about the files that were not processed due to insufficient token budget, and the modified files that were not included in the diff.

✨ Ask tool usage guide:

Overview:
The ask tool answers questions about the PR, based on the PR code changes.
It can be invoked manually by commenting on any PR:

/ask "..."

Note that the tool does not have "memory" of previous questions, and answers each question independently.
You can ask questions about the entire PR, about specific code lines, or about an image related to the PR code changes.

See the ask usage page for a comprehensive guide on using this tool.

ElonaZharri · 2024-04-18T13:22:28Z

Preparing review...

tanmayshishodia · 2024-09-04T20:09:16Z

Preparing PR description...

tanmayshishodia · 2024-09-04T20:12:52Z

Preparing PR description...

tanmayshishodia · 2024-09-04T20:19:17Z

Preparing PR description...

tanmayshishodia · 2024-09-04T20:21:13Z

Preparing PR description...

tanmayshishodia · 2024-09-04T21:36:15Z

Preparing PR description...

tanmayshishodia · 2024-09-04T21:37:19Z

Preparing PR description...

tanmayshishodia · 2024-09-04T21:38:50Z

Preparing PR description...

tanmayshishodia · 2024-09-04T21:47:35Z

Preparing PR description...

tanmayshishodia · 2024-09-04T21:51:07Z

Preparing PR description...

tanmayshishodia · 2024-09-04T21:51:32Z

Preparing PR description...

tanmayshishodia · 2024-09-04T21:52:26Z

Preparing PR description...

tanmayshishodia · 2024-09-05T15:22:35Z

Preparing PR description...

tanmayshishodia · 2024-09-05T16:35:22Z

Preparing PR description...

mrT23 added do not merge yet WIP labels Jul 15, 2023

mrT23 requested a review from hussam789 July 15, 2023 13:10

mrT23 force-pushed the tr/numbered_hunks branch from c36426f to ce6d233 Compare July 15, 2023 18:37

mrT23 removed do not merge yet WIP labels Jul 15, 2023

mrT23 changed the title ~~Code suggestions as a separate tool~~ Adding numbered hunks and code suggestions feature Jul 15, 2023