Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-128515: Add BOLT build to CI #128845

Merged
merged 8 commits into from
Jan 18, 2025
Merged

gh-128515: Add BOLT build to CI #128845

merged 8 commits into from
Jan 18, 2025

Conversation

zanieb
Copy link
Contributor

@zanieb zanieb commented Jan 14, 2025

Adds BOLT test coverage to CI, which will allow us to prevent regressions and move towards stabilization of this feature.

Of note:


@zanieb zanieb force-pushed the zb/bolt branch 5 times, most recently from 29351fc to 1d7ab1e Compare January 14, 2025 20:35
Copied from the JIT workflow
@zanieb
Copy link
Contributor Author

zanieb commented Jan 14, 2025

Interesting, test_pickle failing on the instrumented binaries. Will need to investigate that, as I haven't seen it before.

edit: This occurs because test_unpickle_module_race fails on a read-only file system. See c3a3800

@zanieb
Copy link
Contributor Author

zanieb commented Jan 14, 2025

I encountered a couple blockers for aarch64, a failed assertion in the instrumented binary

./python -m test --pgo --rerun --verbose3 --timeout=
python: ../cpython-ro-srcdir/Python/generated_cases.c.h:1074: _PyEval_EvalFrameDefault: Assertion `tp->tp_alloc == PyType_GenericAlloc' failed.
Aborted (core dumped)

and (after hacking past that) a segfault in BOLT

# Run bolt against the merged data to produce an optimized binary.
for bin in python; do \
  /usr/lib/llvm-19/bin/llvm-bolt "${bin}.prebolt" -o "${bin}.bolt" -data="${bin}.fdata" -update-debug-sections -skip-funcs=_PyEval_EvalFrameDefault,sre_ucs1_match/1,sre_ucs2_match/1,sre_ucs4_match/1  -reorder-blocks=ext-tsp -reorder-functions=cdsort -split-functions -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=none -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot ; \
  mv "${bin}.bolt" "${bin}"; \
done
BOLT-INFO: Target architecture: aarch64
BOLT-INFO: BOLT version: <unknown>
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: enabling relocation mode
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: number of removed linker-inserted veneers: 0
BOLT-INFO: 8500 out of 12058 functions in the binary (70.5%) have non-empty execution profile
BOLT-INFO: 41 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: removed 1 empty block
BOLT-INFO: ICF folded 678 out of 12439 functions in 5 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 46.23 KB of code space. Folded functions were called 3909549484 times based on profile.
BOLT-INFO: ICP Total indirect calls = 1808544446, 153 callsites cover 99% of all indirect calls
 #0 0x0000aacc1be768cc (/usr/lib/llvm-19/bin/llvm-bolt+0x1ae68cc)
 #1 0x0000aacc1be74b80 (/usr/lib/llvm-19/bin/llvm-bolt+0x1ae4b80)
 #2 0x0000aacc1be77174 (/usr/lib/llvm-19/bin/llvm-bolt+0x1ae7174)
 #3 0x0000ff03feee37e0 (linux-vdso.so.1+0x7e0)
 #4 0x0000aacc1c397200 (/usr/lib/llvm-19/bin/llvm-bolt+0x2007200)
 #5 0x0000aacc1c39aa1c (/usr/lib/llvm-19/bin/llvm-bolt+0x200aa1c)
 #6 0x0000aacc1c39a9e4 (/usr/lib/llvm-19/bin/llvm-bolt+0x200a9e4)
 #7 0x0000aacc1c39a9e4 (/usr/lib/llvm-19/bin/llvm-bolt+0x200a9e4)
 #8 0x0000aacc1bf1ebc4 (/usr/lib/llvm-19/bin/llvm-bolt+0x1b8ebc4)
 #9 0x0000aacc1bf21328 (/usr/lib/llvm-19/bin/llvm-bolt+0x1b91328)
#10 0x0000aacc1becfe3c (/usr/lib/llvm-19/bin/llvm-bolt+0x1b3fe3c)
#11 0x0000aacc1aadf2f0 (/usr/lib/llvm-19/bin/llvm-bolt+0x74f2f0)
#12 0x0000ff03fe8684c4 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
#13 0x0000ff03fe868598 call_init ./csu/../csu/libc-start.c:128:20
#14 0x0000ff03fe868598 __libc_start_main ./csu/../csu/libc-start.c:347:5
#15 0x0000aacc1aadd4f0 (/usr/lib/llvm-19/bin/llvm-bolt+0x74d4f0)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /usr/lib/llvm-19/bin/llvm-bolt python.prebolt -o python.bolt -data=python.fdata -update-debug-sections -skip-funcs=_PyEval_EvalFrameDefault,sre_ucs1_match/1,sre_ucs2_match/1,sre_ucs4_match/1 -reorder-blocks=ext-tsp -reorder-functions=cdsort -split-functions -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=none -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
Segmentation fault (core dumped)

I dropped aarch64 in 684ece4 — we can add it later.

@corona10 corona10 self-assigned this Jan 14, 2025
@zanieb
Copy link
Contributor Author

zanieb commented Jan 14, 2025

A few tests are failing after BOLT optimization. I'd appreciate some guidance on that.

test_sys_api (test.test_perf_profiler.TestPerfTrampoline.test_sys_api) ... FAIL
test_trampoline_works (test.test_perf_profiler.TestPerfTrampoline.test_trampoline_works) ... FAIL
test_trampoline_works_with_forks (test.test_perf_profiler.TestPerfTrampoline.test_trampoline_works_with_forks) ... FAIL

======================================================================
FAIL: test_sys_api (test.test_perf_profiler.TestPerfTrampoline.test_sys_api)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/test/test_perf_profiler.py", line 203, in test_sys_api
    self.assertIn(f"py::spam:{script}", perf_file_contents)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'py::spam:/tmp/test_python_qxe1_ajb/tmpqablk9qp/perftest.py' not found in '7f2d97946000 80600b py::baz:/tmp/test_python_qxe1_ajb/tmpqablk9qp/perftest.py\n'

======================================================================
FAIL: test_trampoline_works (test.test_perf_profiler.TestPerfTrampoline.test_trampoline_works)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/test/test_perf_profiler.py", line 91, in test_trampoline_works
    self.assertIsNotNone(
    ~~~~~~~~~~~~~~~~~~~~^
        perf_line, f"Could not find {expected_symbol} in perf file"
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
AssertionError: unexpectedly None : Could not find py::foo:/tmp/test_python_qxe1_ajb/tmpdd3d4w9f/perftest.py in perf file

======================================================================
FAIL: test_trampoline_works_with_forks (test.test_perf_profiler.TestPerfTrampoline.test_trampoline_works_with_forks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/test/test_perf_profiler.py", line 145, in test_trampoline_works_with_forks
    self.assertEqual(process.returncode, 0)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: -11 != 0

----------------------------------------------------------------------
Ran 3 tests in 0.463s

FAILED (failures=3)
test test_perf_profiler failed
1 test failed again:
    test_perf_profiler

@zanieb
Copy link
Contributor Author

zanieb commented Jan 14, 2025

The timing on this actually seems pretty reasonable at 13 minutes.

We could expand this to perform other build optimizations, e.g., PGO, to verify they're working as intended? Right now it's just BOLT though.

@corona10
Copy link
Member

Two things

  • We should use this action for BOLT only since the test coverage is different from PGO+LTO build.
  • Let's skip 3 test failure tests by using @unittest.skipIf(support.check_bolt_optimized.

@zanieb
Copy link
Contributor Author

zanieb commented Jan 15, 2025

We should use this action for BOLT only since the test coverage is different from PGO+LTO build.

Can you expand on this comment?

Let's skip 3 test failure tests by using @unittest.skipIf(support.check_bolt_optimized.

Sounds good to me — should I open an issue to investigate why they fail too? Like is the profiler actually broken?

@corona10
Copy link
Member

Can you expand on this comment?

Because we skip several tests with BOLTed binary, PGO + LTO can not check the regression issue where tests are skipped. Currently, PGO + LTO is the standard optimization policy of the CPython project.
So this is why I suggested let's handle it separately for the PGO + LTO build.

Sounds good to me — should I open an issue to investigate why they fail too? Like is the profiler actually broken?

Yeah, we should; maybe @pablogsal is interested in this issue.

@zanieb
Copy link
Contributor Author

zanieb commented Jan 15, 2025

Created a tracking issue at #128883; skipped the tests in 01cb8d8

@zanieb zanieb marked this pull request as ready for review January 15, 2025 15:05
@zanieb zanieb added the infra CI, GitHub Actions, buildbots, Dependabot, etc. label Jan 15, 2025
Comment on lines +253 to +258
# Do not test BOLT with free-threading, to conserve resources
- bolt: true
free-threading: true
# BOLT currently crashes during instrumentation on aarch64
- os: ubuntu-24.04-aarch64
bolt: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings about this pattern (using exclude instead of include), but liked that I could document why we're not running the additional cases.

@@ -246,10 +250,17 @@ jobs:
exclude:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion, but I would prefer to have just 1, 2, or 3 jobs with bolt. unless it is absolutely needed.

We can move some very specific builds to buildbots, while maintatining the bare minimum in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just one job with BOLT — I think in the future we'd want a second job for aarch64 once that's unblocked. Are you suggesting I should frame this as an include instead? ref #128845 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Sorry for not being clear :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait to change it until others have a chance to weigh in, but I'm not opposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to do this but found it awkward since I know we want aarch64 testing here eventually. Since @hugovk 👍 my comment at #128845 (comment) I think I'll leave it for now.

run: >-
PROFILE_TASK='-m test --pgo --ignore test_unpickle_module_race'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, why doesn't it raise any issues while we build PGO build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a PGO build in CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The read-only build file system looks specific to this CI setup)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a PGO build in CI

FYI, At Github Action there is no CI for PGO and LTO.
But at build bot we run CI for the PGO and LTO.
https://buildbot.python.org/#/builders/378/builds/1554

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The read-only build file system looks specific to this CI setup)

Let me take a look more detail. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/python/cpython/actions/runs/12776586124/job/35615597427

ERROR: test_unpickle_module_race (test.test_pickle.PyUnpicklerTests.test_unpickle_module_race)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/test/support/import_helper.py", line 48, in forget
    unlink(source + 'c')
    ~~~~~~^^^^^^^^^^^^^^
  File "/home/runner/work/cpython/cpython-ro-srcdir/Lib/test/support/os_helper.py", line 345, in unlink
    _unlink(filename)
    ~~~~~~~^^^^^^^^^^
OSError: [Errno 30] Read-only file system: '/home/runner/work/cpython/cpython-ro-srcdir/Lib/locker.pyc'

I think it's fine to remove the test from the optimization suite. It seems likely for there to be some problems here due to the read-only setup. It's known that the tests require a writeable source directory

- name: Remount sources writable for tests
# some tests write to srcdir, lack of pyc files slows down testing
run: sudo mount "$CPYTHON_RO_SRCDIR" -oremount,rw
- name: Tests
working-directory: ${{ env.CPYTHON_BUILDDIR }}
run: xvfb-run make ci

Copy link
Contributor Author

@zanieb zanieb Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#29904 added the read-only out of tree builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We could disable the read-only builds for the BOLT job but it seems more painful than it's worth)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay let’s exclude the test with current way and let’s pile the issue about the test suite problem. Thank you for the investigation:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugovk Do we have any reason to mount file system as the read-only?

Zanie found it was added in #29904, which gives this reason:

The Ubuntu test runner now configures and compiles CPython out of tree.
The source directory is a read-only bind mount to ensure that the build
cannot create or modify any files in the source tree.

Copy link
Member

@hugovk hugovk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

.github/workflows/build.yml Outdated Show resolved Hide resolved
.github/workflows/build.yml Outdated Show resolved Hide resolved
.github/workflows/reusable-ubuntu.yml Outdated Show resolved Hide resolved
run: >-
PROFILE_TASK='-m test --pgo --ignore test_unpickle_module_race'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugovk Do we have any reason to mount file system as the read-only?

Zanie found it was added in #29904, which gives this reason:

The Ubuntu test runner now configures and compiles CPython out of tree.
The source directory is a read-only bind mount to ensure that the build
cannot create or modify any files in the source tree.

Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks you for the work!

@corona10 corona10 enabled auto-merge (squash) January 18, 2025 07:26
@corona10 corona10 merged commit 9ed7bf2 into python:main Jan 18, 2025
44 checks passed
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra CI, GitHub Actions, buildbots, Dependabot, etc. skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants