Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TestGenEval benchmark #5534

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
7a4729c
initial TestGenEval code
Nov 29, 2024
c6206f5
Initial pass for TestGenEval
Nov 29, 2024
280baa2
Licensing
Nov 29, 2024
75fba59
Readability metrics
Nov 29, 2024
f7f2531
Fixing testing dependencies
Dec 4, 2024
30197e6
Add option for starting point
Dec 4, 2024
791b7f9
Cleaning to not OOM
Dec 5, 2024
bd66d09
Merge pull request #1 from All-Hands-AI/main
kjain14 Dec 6, 2024
585dba9
TestGenEval MVP
Dec 11, 2024
3af6025
+ mutation testing
Dec 11, 2024
b19f735
Merge pull request #2 from All-Hands-AI/main
kjain14 Dec 11, 2024
7c81deb
Update README
Dec 11, 2024
2cd64bc
Merge branch 'main' of https://github.com/kjain14/OpenHands
Dec 11, 2024
b685c67
reset
Dec 12, 2024
fb9bc87
testgeneval deps
Dec 12, 2024
77a153e
Final update, now working on all projects
Dec 16, 2024
3401bd6
Update TestGenEval README with comprehensive information
openhands-agent Dec 25, 2024
b47da9e
Merge branch 'main' of github.com:All-Hands-AI/OpenHands into kjain14…
neubig Dec 25, 2024
90422e5
Update lock file
neubig Dec 25, 2024
31b6967
Any and all pass
Jan 8, 2025
1ded123
Reset to normal time
Jan 8, 2025
efb525a
Refine postprocessing
Jan 9, 2025
219a134
Refine prompt
Jan 10, 2025
3f0f13d
Update prompt
Jan 10, 2025
d1e8409
Update filtering
Jan 17, 2025
8848e60
Only top level filtering
Jan 17, 2025
3355bae
Merge branch 'main' of github.com:kjain14/OpenHands into kjain14-main
neubig Jan 20, 2025
9f9a65c
More updates
Jan 20, 2025
f781bc8
Fix prompting
Jan 28, 2025
c7d575b
Removing duplicate script
Jan 28, 2025
64abd4a
Ablation outputs
Jan 30, 2025
e7a8daf
Fixing code to handle ablations
Feb 4, 2025
eef0ed3
Final prompt for final experiments
Feb 6, 2025
8782e3a
Merge pull request #3 from All-Hands-AI/main
kjain14 Feb 6, 2025
d8ad8ba
Merge branch 'main' into main
neubig Feb 8, 2025
326e75e
Remove prompt truncation
neubig Feb 9, 2025
4471002
Remove unneeded input
neubig Feb 9, 2025
fd53378
Rename eval-infer
neubig Feb 10, 2025
513dd97
Restore testgeneval poetry group
openhands-agent Feb 10, 2025
1290a25
Update lock
neubig Feb 10, 2025
ace9e6e
Update TestGenEval README to include dependency installation
openhands-agent Feb 10, 2025
eb36426
Adding so file for codebleu
kjain14 Feb 11, 2025
09335d6
Merging
Feb 16, 2025
367c8a9
Merging
Feb 16, 2025
92ddc1b
Merge branch 'All-Hands-AI-main'
Feb 16, 2025
4984bf6
Merge pull request #6 from All-Hands-AI/main
kjain14 Feb 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Final prompt for final experiments
Kush Dave Jain committed Feb 6, 2025
commit eef0ed3410d4edc00e4d16880e28dd22ededf52e
9 changes: 9 additions & 0 deletions evaluation/benchmarks/testgeneval/NOTES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
codamosa_ids = ['pydata__xarray-4750-16496', 'pydata__xarray-3239-16458', 'pydata__xarray-4966-16515', 'pydata__xarray-3302-16459', 'pydata__xarray-5126-16518', 'pydata__xarray-4994-16516', 'pydata__xarray-3905-16478', 'pydata__xarray-4182-16484', 'pydata__xarray-5131-16520', 'pydata__xarray-5662-16532', 'pydata__xarray-3364-16461', 'pydata__xarray-5731-16534', 'pydata__xarray-3239-16457', 'pydata__xarray-7203-16577', 'pydata__xarray-3156-16454', 'pydata__xarray-5126-16519', 'pydata__xarray-5365-16529', 'pydata__xarray-4629-16492', 'pydata__xarray-4248-16486', 'pydata__xarray-4339-16487', 'pydata__xarray-3151-16453', 'pydata__xarray-3114-16452', 'pydata__xarray-5033-16517', 'pydata__xarray-4802-16505', 'pydata__xarray-5455-16530', 'pydata__xarray-6400-16539', 'pydata__xarray-3239-16456', 'pydata__xarray-4419-16488']

pynguin_ids = ['pydata__xarray-6548-16541', 'pydata__xarray-7003-16557', 'pydata__xarray-3114-16452', 'pydata__xarray-4339-16487', 'pydata__xarray-6889-16549', 'pydata__xarray-3239-16458', 'pydata__xarray-3364-16461', 'pydata__xarray-3239-16457', 'pydata__xarray-5365-16529', 'pydata__xarray-5131-16520', 'pydata__xarray-7229-16578', 'pydata__xarray-6461-16540', 'pydata__xarray-4419-16488', 'pydata__xarray-7147-16571', 'pydata__xarray-3151-16453', 'pydata__xarray-4966-16515', 'pydata__xarray-4629-16492', 'pydata__xarray-3239-16456', 'pydata__xarray-7400-16582', 'pydata__xarray-4994-16516', 'pydata__xarray-3302-16459', 'pydata__xarray-6601-16544', 'pydata__xarray-6882-16548', 'pydata__xarray-6135-16535', 'pydata__xarray-7393-16581', 'pydata__xarray-5731-16534', 'pydata__xarray-7203-16577']

ids = ['pydata__xarray-3114-16452', 'pydata__xarray-3151-16453', 'pydata__xarray-3156-16454', 'pydata__xarray-3239-16456', 'pydata__xarray-3239-16457', 'pydata__xarray-3239-16458', 'pydata__xarray-3302-16459', 'pydata__xarray-3364-16461', 'pydata__xarray-3677-16471', 'pydata__xarray-3905-16478', 'pydata__xarray-4182-16484', 'pydata__xarray-4248-16486', 'pydata__xarray-4339-16487', 'pydata__xarray-4419-16488', 'pydata__xarray-4629-16492', 'pydata__xarray-4750-16496', 'pydata__xarray-4802-16505', 'pydata__xarray-4966-16515', 'pydata__xarray-4994-16516', 'pydata__xarray-5033-16517', 'pydata__xarray-5126-16518', 'pydata__xarray-5126-16519', 'pydata__xarray-5131-16520', 'pydata__xarray-5365-16529', 'pydata__xarray-5455-16530', 'pydata__xarray-5662-16532', 'pydata__xarray-5731-16534', 'pydata__xarray-6135-16535', 'pydata__xarray-6135-16536', 'pydata__xarray-6386-16537', 'pydata__xarray-6394-16538', 'pydata__xarray-6400-16539', 'pydata__xarray-6461-16540', 'pydata__xarray-6548-16541', 'pydata__xarray-6599-16543', 'pydata__xarray-6601-16544', 'pydata__xarray-6882-16548', 'pydata__xarray-6889-16549', 'pydata__xarray-7003-16557', 'pydata__xarray-7147-16571', 'pydata__xarray-7150-16572', 'pydata__xarray-7203-16577', 'pydata__xarray-7229-16578', 'pydata__xarray-7393-16581', 'pydata__xarray-7400-16582']


Command eval (our approach):
poetry run ./evaluation/benchmarks/testgeneval/scripts/eval_infer_remote.sh evaluation/evaluation_outputs/outputs/kjain14__testgeneval-test/CodeActAgent/gpt-4o_maxiter_25_N_v0.20.0-no-hint-run_1/output.jsonl 10 kjain14/testgeneval test true

Command run (our approach):
./evaluation/benchmarks/testgeneval/scripts/run_infer.sh llm.eval_gpt HEAD CodeActAgent -1 25 10 kjain14/testgeneval test 1 ../TestGenEval/results/testgeneval/preds/gpt-4o-2024-08-06__testgeneval__0.2__test.jsonl
4 changes: 2 additions & 2 deletions evaluation/benchmarks/testgeneval/eval_infer.py
Original file line number Diff line number Diff line change
@@ -28,7 +28,7 @@
)
from evaluation.benchmarks.testgeneval.pygments_utils import tokenize_code
from evaluation.benchmarks.testgeneval.run_infer import get_instance_docker_image
from evaluation.benchmarks.testgeneval.test_filter import filter_passing_tests
from evaluation.benchmarks.testgeneval.test_filter import filter_tests
from evaluation.benchmarks.testgeneval.test_spec import (
TestGenEvalInstance,
TestSpec,
@@ -221,7 +221,7 @@ def grade_test_output(
)

logger.info('Calling filter unit tests')
filtered_content, passing_tests, failing_tests = filter_passing_tests(
filtered_content, passing_tests, failing_tests = filter_tests(
test_suite, unit_test_output, test_spec.repo
)

4 changes: 2 additions & 2 deletions evaluation/benchmarks/testgeneval/prompt.py
Original file line number Diff line number Diff line change
@@ -74,7 +74,7 @@
NOTE: if there is an error executing tests you MUST fix it before exiting. DO NOT install new packages.
NOTE: if outputting a revised test suite REPLACE {test_file} with the revised suite

**Output the final test suite** (20+ tests) for {test_file} in a single code block, no extra commentary.
**Output the final test suite** (20+ tests) for {test_file} in a single code block, no extra commentary. MAKE SURE you run the tests and ensure you can see which tests passed and failed BEFORE exiting.
"""

CODEACT_TESTGEN_PROMPT_ITERATE = """
@@ -110,5 +110,5 @@
NOTE: if there is an error executing tests you MUST fix it before exiting. DO NOT install new packages.
NOTE: if outputting a revised test suite REPLACE {test_file} with the revised suite

**Output the final test suite** (20+ tests) for {test_file} in a single code block, no extra commentary.
**Output the final test suite** (20+ tests) for {test_file} in a single code block, no extra commentary. MAKE SURE you run the tests and ensure you can see which tests passed and failed BEFORE exiting.
"""
58 changes: 32 additions & 26 deletions evaluation/benchmarks/testgeneval/run_infer.py
Original file line number Diff line number Diff line change
@@ -354,33 +354,37 @@ def complete_runtime(
If you need to do something in the sandbox to get the correctness metric after
the agent has run, modify this function.
"""
logger.info('-' * 30)
logger.info('BEGIN Runtime Completion Fn')
logger.info('-' * 30)
obs: CmdOutputObservation
workspace_dir_name = _get_swebench_workspace_dir_name(instance)

action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
obs.exit_code == 0,
f'Failed to cd to /workspace/{workspace_dir_name}: {str(obs)}',
)
try:
logger.info('-' * 30)
logger.info('BEGIN Runtime Completion Fn')
logger.info('-' * 30)
obs: CmdOutputObservation
workspace_dir_name = _get_swebench_workspace_dir_name(instance)

action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
obs.exit_code == 0,
f'Failed to cd to /workspace/{workspace_dir_name}: {str(obs)}',
)

action = CmdRunAction(command=f'cat {instance.test_file}')
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
obs.exit_code == 0,
f'Failed to find file: {instance.test_file} in /workspace/{workspace_dir_name}',
)
action = CmdRunAction(command=f'cat {instance.test_file}')
action.set_hard_timeout(600)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
obs.exit_code == 0,
f'Failed to find file: {instance.test_file} in /workspace/{workspace_dir_name}',
)

test_suite = obs.content.strip()
test_suite = obs.content.strip()
except Exception:
print('Skipping, exeception in complete_runtime')
test_suite = instance['full_pred'] if instance['full_pred'] is not None else ''

# action = CmdRunAction(command='git add -A')
# action.set_hard_timeout(600)
@@ -471,7 +475,7 @@ def process_instance(

# Save the output
output = EvalOutput(
instance_id=instance.instance_id,
instance_id=instance.id,
instruction=instruction,
instance=_preprocess_instance(instance.to_dict()), # SWE Bench specific
test_result=test_result,
@@ -480,6 +484,8 @@ def process_instance(
metrics=metrics,
error=state.last_error if state and state.last_error else None,
)
print(output)
input()
return output


86 changes: 84 additions & 2 deletions evaluation/benchmarks/testgeneval/test_filter.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import ast
import re
from typing import List, Tuple

@@ -203,10 +204,10 @@ def filter_passing_tests(
method_full_name = (
method_name.split('.')[-1].split('(')[0].strip().split(' ')[-1]
)
# Check if the method name is in passing_tests or if any passing_test is in the method name
# Check if the method name is in failing_tests or if any failing_test is in the method name
if not (
any(method_full_name in failing_test for failing_test in failing_tests)
and not any(
or any(
failing_test in method_full_name for failing_test in failing_tests
)
):
@@ -241,3 +242,84 @@ def filter_passing_tests(
content_parts.append(func_body)

return '\n\n'.join(content_parts), passing_tests, failing_tests


def filter_tests(
test_content: str, test_output: str, repo: str
) -> Tuple[str, List[str], List[str]]:
"""
Filter tests using AST parsing to remove failing test functions from the test file.
Non-test functions (e.g. setup or helper methods) and classes (even if all test methods are failing)
are preserved.

If AST processing fails (for example, because the test file cannot be parsed),
this function falls back on the existing regex-based filtering (filter_passing_tests).

Returns:
Tuple containing:
- Modified test content (as a string) containing only passing tests.
- List of passing test names.
- List of failing test names.
"""
try:
# Attempt to parse the test file using the AST.
tree = ast.parse(test_content)

# Parse test results using the appropriate parser.
parser = MAP_REPO_TO_PARSER.get(repo, parse_log_pytest)
test_results = parser(test_output)
passing_tests = [
name
for name, status in test_results.items()
if status == TestStatus.PASSED.value
]
failing_tests = [
name
for name, status in test_results.items()
if status != TestStatus.PASSED.value
]

# Helper function to decide if a test name should be considered failing.
def is_failing(name: str) -> bool:
for ft in failing_tests:
if name in ft or ft in name:
return True
return False

new_body = []
for node in tree.body:
# For top-level function definitions, only filter those that look like tests.
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
if node.name.startswith('test') and is_failing(node.name):
continue
new_body.append(node)
# For classes, filter out failing test methods but preserve other methods (e.g. setup).
elif isinstance(node, ast.ClassDef):
new_class_body = []
for subnode in node.body:
if isinstance(subnode, (ast.FunctionDef, ast.AsyncFunctionDef)):
# Only consider filtering if the method is a test.
qualified_name = f'{node.name}.{subnode.name}'
if is_failing(subnode.name) or is_failing(qualified_name):
continue
new_class_body.append(subnode)
else:
new_class_body.append(subnode)
# Always include the class even if no test methods remain, as it might contain
# setup, teardown, or other necessary logic.
node.body = new_class_body
new_body.append(node)
else:
new_body.append(node)

tree.body = new_body

# Reconstruct the source code from the filtered AST.
# (Requires Python 3.9+ for ast.unparse; otherwise an exception will trigger the fallback.)
new_test_content = ast.unparse(tree)
return new_test_content, passing_tests, failing_tests

except Exception:
print('AST processing failed; falling back on regex-based filtering.')
# If AST processing fails for any reason, fall back on the original regex-based filtering.
return filter_passing_tests(test_content, test_output, repo)
1 change: 1 addition & 0 deletions evaluation/benchmarks/testgeneval/utils.py
Original file line number Diff line number Diff line change
@@ -28,6 +28,7 @@ def get_test_directives(instance: TestGenEvalInstance) -> list:

# For Django tests, remove extension + "tests/" prefix and convert slashes to dots (module referencing)
if instance['repo'] == 'django/django':
directives = [instance['test_file']]
directives_transformed = []
for d in directives:
d = d[: -len('.py')] if d.endswith('.py') else d