Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BOLT] Optimized binary segfaults when using --use-old-text #89117

Closed
Zentrik opened this issue Apr 17, 2024 · 14 comments
Closed

[BOLT] Optimized binary segfaults when using --use-old-text #89117

Zentrik opened this issue Apr 17, 2024 · 14 comments
Labels

Comments

@Zentrik
Copy link
Contributor

Zentrik commented Apr 17, 2024

I'm using BOLT built from LLVM 18.1.4, I also see a segfault on 18.1.3 and 16.

I run bolt libjulia-internal.so.1.12.0.original -o libjulia-internal.so.1.12.0 --use-old-text -no-huge-pages (-no-huge-pages is needed otherwise it doesn't use the old text). Here's the log

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: e6c3289804a67ea0bb6a86fadbe454dd93b8d855
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x61e000, offset 0x61e000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-WARNING: Failed to analyze 22 relocations
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function scm_to_julia_/1(*2)
BOLT-INFO: 0 out of 5651 functions in the binary (0.0%) have non-empty execution profile
BOLT-INFO: the input contains 597 (dynamic count : 0) opportunities for macro-fusion optimization that are going to be fixed
BOLT-INFO: 36848 instructions were shortened
BOLT-INFO: removed 12783 empty blocks
BOLT-INFO: SCTC: patched 77 tail calls (71 forward) tail calls (6 backward) from a total of 77 while removing 0 double jumps and removing 52 basic blocks totalling 260 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0
BOLT-INFO: using original .text for new code with 0x1000 alignment
BOLT-INFO: setting _end to 0x66669c
BOLT-INFO: patched build-id (flipped last bit)

I compiled with gcc with -fno-reorder-blocks-and-partition and -Wl,--emit-relocs.

Binaries are here if you want them (this was with the BOLT from LLVM 16).

If you want to reproduce the segfault on x86_64 linux you can download julia from the artifacts here https://buildkite.com/julialang/julia-master/builds/34942#018e6d3a-e24c-4bad-8406-6fbb670d9309. You can then BOLT the original binary or use the one I've already BOLTED here and replace libjulia-internal.so.1.12.0 in lib/julia with it. Then just run bin/julia from the root directory.

@github-actions github-actions bot added the BOLT label Apr 17, 2024
@aaupov
Copy link
Contributor

aaupov commented Apr 17, 2024

Can you please check which function contains the illegal instruction, and dump the function throughout BOLT's optimizations with --print-only=<func> --print-all?

@Zentrik

This comment was marked as outdated.

@Zentrik

This comment was marked as outdated.

@Zentrik
Copy link
Contributor Author

Zentrik commented Apr 17, 2024

Ok I had to use --print-only=fl_file/1
Here's the output: https://gist.github.com/Zentrik/d901bfc144af6348942f00a9558b8bd8

@Zentrik
Copy link
Contributor Author

Zentrik commented Apr 17, 2024

Btw the segfault occurs at 11af52: 48 8b 10 mov (%rax),%rdx.

@Zentrik
Copy link
Contributor Author

Zentrik commented Apr 18, 2024

As far as I can tell in the apply_cl function we have jmp *(%r14, %rax, 8). Without --use-old-text this doesn't jump to fl_file but it does with --use-old-text. %rax is 66 in both binaries.

@Zentrik
Copy link
Contributor Author

Zentrik commented Apr 19, 2024

This replicates on 0dcabba. It seems in apply_cl we use computed go to's which is what that jump is.

@Zentrik
Copy link
Contributor Author

Zentrik commented Apr 19, 2024

Ok so the issue seems to be that BOLT doesn't update the values in https://github.com/JuliaLang/julia/blob/7f8635f11cae5f3f592afcc7b55c8e0e23589c3d/src/flisp/opcodes.h#L39 so we end up jumping to the wrong place. Similarly it doesn't look like vm_apply_labels gets updated.

@aaupov
Copy link
Contributor

aaupov commented Apr 19, 2024

Thank you for finding the root cause. Please skip the function containing computed goto jump table with -skip-funcs=<func>. We don’t fully support them, and there’s a similar issue with CPython (I had a WIP diff to support it but it’s low priority for me at the moment).

@Zentrik
Copy link
Contributor Author

Zentrik commented Apr 19, 2024

Thanks. If I don't use the old text flag is it fine to not skip it or might it still cause problems?

@Zentrik
Copy link
Contributor Author

Zentrik commented Apr 19, 2024

Just to note, neither -skip-funcs=apply_cl or -skip-funcs=apply_cl* seemed to fix it

@maksfb
Copy link
Contributor

maksfb commented Apr 22, 2024

Even with --skip-funcs, listed functions are still rewritten with --use-old-text. We need a proper support for computed goto's with PIC. Meanwhile, you have to drop --use-old-text.

@maksfb
Copy link
Contributor

maksfb commented Apr 23, 2024

I've created #89681 to prevent BOLT from producing corrupted binaries in such cases.

@maksfb
Copy link
Contributor

maksfb commented Nov 7, 2024

Presumably, fixed by #89681.

@maksfb maksfb closed this as completed Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants