Skip to content

Commit

Permalink
Switch ACR run 2 and 3
Browse files Browse the repository at this point in the history
  • Loading branch information
aorwall committed May 13, 2024
1 parent c4b8b02 commit f3f74d2
Show file tree
Hide file tree
Showing 489 changed files with 41,347 additions and 41,351 deletions.
8 changes: 2 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,8 @@ In AutoCodeRover's own benchmarks, they achieve 16.00% (48), 15.67% (47), and 16
swe-bench-docker, the same predictions result in [18.00% (54)](https://github.com/aorwall/SWE-bench-docker/blob/main/evaluations/auto-code-rover-run-1),
[19% (57)](https://github.com/aorwall/SWE-bench-docker/blob/main/evaluations/auto-code-rover-run-2) and
[19% (57)](https://github.com/aorwall/SWE-bench-docker/blob/main/evaluations/auto-code-rover-run-3) resolved issues.
This adds up to a pass@3 of 26% (78) compared to 22.33% (67) reported in the [AutoCodeRover paper](https://arxiv.org/pdf/2404.05427).
In each individual run, there are also benchmark instances that fail in swe-bench-docker's evaluation but not in
AutoCodeRover's. Thus, it seems there are still false positives or negatives that are not detected when comparing with
gold patches, likely due to incorrect dependency versions

But this suggests that other agents' benchmarks may show lower results than they actually achieve because it's challenging
This adds up to a pass@3 of 26% (78) compared to 22.33% (67) reported in the [AutoCodeRover paper](https://arxiv.org/pdf/2404.05427).
This suggests that other agents' benchmarks may show lower results than they actually achieve because it's challenging
to conduct evaluations with completely accurate results.

## Docker images types
Expand Down
114 changes: 55 additions & 59 deletions evaluations/auto-code-rover-run-2/README.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ ERROR conda.cli.main_run:execute(124): `conda run pytest --no-header -rA --tb=no
============================= test session starts ==============================
collected 10 items

astropy/io/ascii/tests/test_rst.py ........FF [100%]
astropy/io/ascii/tests/test_rst.py .........F [100%]

=========================== short test summary info ============================
PASSED astropy/io/ascii/tests/test_rst.py::test_read_normal
Expand All @@ -72,9 +72,9 @@ astropy/io/ascii/tests/test_rst.py ....
PASSED astropy/io/ascii/tests/test_rst.py::test_read_unbounded_right_column_header
PASSED astropy/io/ascii/tests/test_rst.py::test_read_right_indented_table
PASSED astropy/io/ascii/tests/test_rst.py::test_trailing_spaces_in_row_definition
FAILED astropy/io/ascii/tests/test_rst.py::test_write_normal - TypeError: can only concatenate list (not "Table") to list
PASSED astropy/io/ascii/tests/test_rst.py::test_write_normal
FAILED astropy/io/ascii/tests/test_rst.py::test_rst_with_header_rows - assert None == Unit("nm")
========================= 2 failed, 8 passed in 0.05s ==========================
========================= 1 failed, 9 passed in 0.08s ==========================


[astropy__astropy__5.1] [astropy__astropy-14182] Return Code: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ collected 180 items
astropy/nddata/mixins/tests/test_ndarithmetic.py ....................... [ 12%]
........................................................................ [ 52%]
........................................................................ [ 92%]
............F [100%]
............. [100%]

=========================== short test summary info ============================
PASSED astropy/nddata/mixins/tests/test_ndarithmetic.py::test_arithmetics_data[data10-data20]
Expand Down Expand Up @@ -243,14 +243,13 @@ astropy/nddata/mixins/tests/test_ndarithmetic.py ...
PASSED astropy/nddata/mixins/tests/test_ndarithmetic.py::test_arithmetics_unknown_uncertainties
PASSED astropy/nddata/mixins/tests/test_ndarithmetic.py::test_psf_warning
PASSED astropy/nddata/mixins/tests/test_ndarithmetic.py::test_raise_method_not_supported
FAILED astropy/nddata/mixins/tests/test_ndarithmetic.py::test_nddata_bitmask_arithmetic - AssertionError:
======================== 1 failed, 179 passed in 0.94s =========================
PASSED astropy/nddata/mixins/tests/test_ndarithmetic.py::test_nddata_bitmask_arithmetic
============================= 180 passed in 1.80s ==============================
<frozen importlib._bootstrap>:228: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility. Expected 80 from C header, got 96 from PyObject

ERROR conda.cli.main_run:execute(124): `conda run pytest --no-header -rA --tb=no -p no:cacheprovider astropy/nddata/mixins/tests/test_ndarithmetic.py` failed. (See above for error)


[astropy__astropy__5.2] [astropy__astropy-14995] Return Code: 1
[astropy__astropy__5.2] [astropy__astropy-14995] Return Code: 0

>>>>> Some Tests Failed
>>>>> All Tests Passed
[astropy__astropy__5.2] [astropy__astropy-14995] Test script run successful
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ FAILED astropy/io/fits/tests/test_table.py::TestTableFunctions::test_numpy_ndarr
FAILED astropy/io/fits/tests/test_table.py::TestTableFunctions::test_new_table_from_recarray
FAILED astropy/io/fits/tests/test_table.py::test_regression_5383 - AttributeE...
FAILED astropy/io/fits/tests/test_table.py::test_table_to_hdu - AttributeErro...
============= 21 failed, 13 passed, 2 skipped, 72 errors in 1.40s ==============
============= 21 failed, 13 passed, 2 skipped, 72 errors in 3.11s ==============
ERROR conda.cli.main_run:execute(124): `conda run pytest --no-header -rA --tb=no -p no:cacheprovider astropy/io/fits/tests/test_checksum.py astropy/io/fits/tests/test_table.py` failed. (See above for error)


Expand Down
Loading

0 comments on commit f3f74d2

Please sign in to comment.