Switch ACR run 2 and 3

aorwall · May 13, 2024 · f3f74d2 · f3f74d2
1 parent c4b8b02
commit f3f74d2
Show file tree

Hide file tree

Showing 489 changed files with 41,347 additions and 41,351 deletions.
diff --git a/README.md b/README.md
@@ -31,12 +31,8 @@ In AutoCodeRover's own benchmarks, they achieve 16.00% (48), 15.67% (47), and 16
 swe-bench-docker, the same predictions result in [18.00% (54)](https://github.com/aorwall/SWE-bench-docker/blob/main/evaluations/auto-code-rover-run-1), 
 [19% (57)](https://github.com/aorwall/SWE-bench-docker/blob/main/evaluations/auto-code-rover-run-2) and 
 [19% (57)](https://github.com/aorwall/SWE-bench-docker/blob/main/evaluations/auto-code-rover-run-3) resolved issues. 
-This adds up to a pass@3 of 26% (78) compared to 22.33% (67) reported in the [AutoCodeRover paper](https://arxiv.org/pdf/2404.05427). 
-In each individual run, there are also benchmark instances that fail in swe-bench-docker's evaluation but not in 
-AutoCodeRover's. Thus, it seems there are still false positives or negatives that are not detected when comparing with
-gold patches, likely due to incorrect dependency versions 
-
-But this suggests that other agents' benchmarks may show lower results than they actually achieve because it's challenging
+This adds up to a pass@3 of 26% (78) compared to 22.33% (67) reported in the [AutoCodeRover paper](https://arxiv.org/pdf/2404.05427).
+This suggests that other agents' benchmarks may show lower results than they actually achieve because it's challenging
 to conduct evaluations with completely accurate results.
 
 ## Docker images types

diff --git a/evaluations/auto-code-rover-run-2/README.md b/evaluations/auto-code-rover-run-2/README.md
diff --git a/evaluations/auto-code-rover-run-2/logs/astropy__astropy-14182.gpt-4-0125-preview.eval.log b/evaluations/auto-code-rover-run-2/logs/astropy__astropy-14182.gpt-4-0125-preview.eval.log
@@ -61,7 +61,7 @@ ERROR conda.cli.main_run:execute(124): `conda run pytest --no-header -rA --tb=no
 [1m============================= test session starts ==============================[0m
 collected 10 items
 
-astropy/io/ascii/tests/test_rst.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[31mF[0m[31mF[0m[31m                            [100%][0m
+astropy/io/ascii/tests/test_rst.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[31mF[0m[31m                            [100%][0m
 
 [36m[1m=========================== short test summary info ============================[0m
 [32mPASSED[0m astropy/io/ascii/tests/test_rst.py::[1mtest_read_normal[0m
@@ -72,9 +72,9 @@ astropy/io/ascii/tests/test_rst.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m
 [32mPASSED[0m astropy/io/ascii/tests/test_rst.py::[1mtest_read_unbounded_right_column_header[0m
 [32mPASSED[0m astropy/io/ascii/tests/test_rst.py::[1mtest_read_right_indented_table[0m
 [32mPASSED[0m astropy/io/ascii/tests/test_rst.py::[1mtest_trailing_spaces_in_row_definition[0m
-[31mFAILED[0m astropy/io/ascii/tests/test_rst.py::[1mtest_write_normal[0m - TypeError: can only concatenate list (not "Table") to list
+[32mPASSED[0m astropy/io/ascii/tests/test_rst.py::[1mtest_write_normal[0m
 [31mFAILED[0m astropy/io/ascii/tests/test_rst.py::[1mtest_rst_with_header_rows[0m - assert None == Unit("nm")
-[31m========================= [31m[1m2 failed[0m, [32m8 passed[0m[31m in 0.05s[0m[31m ==========================[0m
+[31m========================= [31m[1m1 failed[0m, [32m9 passed[0m[31m in 0.08s[0m[31m ==========================[0m
 
 
 [astropy__astropy__5.1] [astropy__astropy-14182] Return Code: 1 

diff --git a/...astropy-14365.gpt-4-0125-preview.eval.log → ...astropy-14365.gpt-4-0125-preview.eval.log b/...astropy-14365.gpt-4-0125-preview.eval.log → ...astropy-14365.gpt-4-0125-preview.eval.log
diff --git a/evaluations/auto-code-rover-run-2/logs/astropy__astropy-14995.gpt-4-0125-preview.eval.log b/evaluations/auto-code-rover-run-2/logs/astropy__astropy-14995.gpt-4-0125-preview.eval.log
@@ -61,7 +61,7 @@ collected 180 items
 astropy/nddata/mixins/tests/test_ndarithmetic.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m [ 12%]
 [0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m [ 52%]
 [0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m [ 92%]
-[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[31mF[0m[31m                                                            [100%][0m
+[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                            [100%][0m
 
 [36m[1m=========================== short test summary info ============================[0m
 [32mPASSED[0m astropy/nddata/mixins/tests/test_ndarithmetic.py::[1mtest_arithmetics_data[data10-data20][0m
@@ -243,14 +243,13 @@ astropy/nddata/mixins/tests/test_ndarithmetic.py [32m.[0m[32m.[0m[32m.[0m
 [32mPASSED[0m astropy/nddata/mixins/tests/test_ndarithmetic.py::[1mtest_arithmetics_unknown_uncertainties[0m
 [32mPASSED[0m astropy/nddata/mixins/tests/test_ndarithmetic.py::[1mtest_psf_warning[0m
 [32mPASSED[0m astropy/nddata/mixins/tests/test_ndarithmetic.py::[1mtest_raise_method_not_supported[0m
-[31mFAILED[0m astropy/nddata/mixins/tests/test_ndarithmetic.py::[1mtest_nddata_bitmask_arithmetic[0m - AssertionError: 
-[31m======================== [31m[1m1 failed[0m, [32m179 passed[0m[31m in 0.94s[0m[31m =========================[0m
+[32mPASSED[0m astropy/nddata/mixins/tests/test_ndarithmetic.py::[1mtest_nddata_bitmask_arithmetic[0m
+[32m============================= [32m[1m180 passed[0m[32m in 1.80s[0m[32m ==============================[0m
 <frozen importlib._bootstrap>:228: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility. Expected 80 from C header, got 96 from PyObject
 
-ERROR conda.cli.main_run:execute(124): `conda run pytest --no-header -rA --tb=no -p no:cacheprovider astropy/nddata/mixins/tests/test_ndarithmetic.py` failed. (See above for error)
 
 
-[astropy__astropy__5.2] [astropy__astropy-14995] Return Code: 1 
+[astropy__astropy__5.2] [astropy__astropy-14995] Return Code: 0 
 
->>>>> Some Tests Failed
+>>>>> All Tests Passed
 [astropy__astropy__5.2] [astropy__astropy-14995] Test script run successful 
diff --git a/evaluations/auto-code-rover-run-2/logs/astropy__astropy-6938.gpt-4-0125-preview.eval.log b/evaluations/auto-code-rover-run-2/logs/astropy__astropy-6938.gpt-4-0125-preview.eval.log
@@ -179,7 +179,7 @@ FAILED astropy/io/fits/tests/test_table.py::TestTableFunctions::test_numpy_ndarr
 FAILED astropy/io/fits/tests/test_table.py::TestTableFunctions::test_new_table_from_recarray
 FAILED astropy/io/fits/tests/test_table.py::test_regression_5383 - AttributeE...
 FAILED astropy/io/fits/tests/test_table.py::test_table_to_hdu - AttributeErro...
-============= 21 failed, 13 passed, 2 skipped, 72 errors in 1.40s ==============
+============= 21 failed, 13 passed, 2 skipped, 72 errors in 3.11s ==============
 ERROR conda.cli.main_run:execute(124): `conda run pytest --no-header -rA --tb=no -p no:cacheprovider astropy/io/fits/tests/test_checksum.py astropy/io/fits/tests/test_table.py` failed. (See above for error)