Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThinLTO breaks kernel with ZFS built-in #1731

Open
yshui opened this issue Oct 6, 2022 · 4 comments
Open

ThinLTO breaks kernel with ZFS built-in #1731

yshui opened this issue Oct 6, 2022 · 4 comments
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 [BUG] Untriaged Something isn't working [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization

Comments

@yshui
Copy link
Member

yshui commented Oct 6, 2022

I reported this in #1440, but after further investigation this looks like a different issue.

The symptom is the same as #1440:

[    0.853948][    T1] jump_label: Fatal kernel bug, unexpected op at swap_writepage+0x1c/0x60 [(____ptrval____)] (eb 1c 48 89 df != 66 90 0f 1f 00)) size:2 type:1
[    0.854952][    T1] ------------[ cut here ]------------
[    0.855258][    T1] kernel BUG at arch/x86/kernel/jump_label.c:73!
[    0.855617][    T1] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    0.855951][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-local+ #25
[    0.856363][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014
[    0.856871][    T1] RIP: 0010:__jump_label_patch+0x190/0x1a0
[    0.857196][    T1] Code: 5e 41 5f 5d e9 d1 66 f7 ff 48 c7 c7 22 63 cf b9 4c 89 fe 4c 89 fa 4c 89 f9 49 89 d8 45 89 e1 41 56 e8 d9 a3 0d 00 48 83 c4 08 <0f> 0b 0f 0b 0f 0b 0f 0b 00 00 cc cc 00 00 cc cc 48 c7 c7 18 e1 25
[    0.858282][    T1] RSP: 0018:ffff8ad701173c18 EFLAGS: 00010286
[    0.858624][    T1] RAX: 000000000000008c RBX: ffffffffb9d62f21 RCX: ffffffffba265fc0
[    0.859063][    T1] RDX: 0000000000000000 RSI: c000000100010ae5 RDI: 0000000000000002
[    0.859508][    T1] RBP: ffffffffba6fe214 R08: 0000000000000000 R09: ffffffffba27e250
[    0.859949][    T1] R10: 00000000ffffffff R11: 0000000100010ae5 R12: 0000000000000002
[    0.860386][    T1] R13: ffffffffb9d62f21 R14: 0000000000000001 R15: ffffffffb7f77fac
[    0.860831][    T1] FS:  0000000000000000(0000) GS:ffff8ad704a00000(0000) knlGS:0000000000000000
[    0.861323][    T1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.861690][    T1] CR2: ffff8ad707a01000 CR3: 0000000007210000 CR4: 0000000000350ff0
[    0.862130][    T1] Call Trace:
[    0.862311][    T1]  <TASK>
[    0.862472][    T1]  ? swap_writepage+0x1c/0x60
[    0.862734][    T1]  ? swap_writepage+0x2b/0x60
[    0.862990][    T1]  ? swap_writepage+0x1e/0x60
[    0.863247][    T1]  ? arch_jump_label_transform_queue+0x26/0x60
[    0.863592][    T1]  ? __jump_label_update+0x99/0x150
[    0.863880][    T1]  ? static_key_slow_inc_cpuslocked+0x4c/0x80
[    0.864214][    T1]  ? frontswap_register_ops+0x2c/0x40
[    0.864517][    T1]  ? init_zswap+0x19b/0x233
[    0.864770][    T1]  ? init_frontswap+0x9b/0x9b
[    0.865027][    T1]  ? do_one_initcall+0x120/0x2b0
[    0.865320][    T1]  ? do_initcall_level+0x7a/0xd8
[    0.865594][    T1]  ? do_initcalls+0x44/0x6b
[    0.865841][    T1]  ? kernel_init_freeable+0xd8/0x122
[    0.866131][    T1]  ? rest_init+0xc0/0xc0
[    0.866369][    T1]  ? kernel_init+0x11/0x1a0
[    0.866621][    T1]  ? ret_from_fork+0x22/0x30
[    0.866876][    T1]  </TASK>
[    0.867041][    T1] Modules linked in:
[    0.867290][    T1] ---[ end trace 0000000000000000 ]---

However, this only happens, if I enable the ZFS module as kernel builtin. (Of course, disabling CONFIG_ZSWAP helps, since it worksaround that jump label).

Interesting how ZFS can trigger an codegen change in code that is seemingly completely unrelated, through LTO.

This is probably related to openzfs/zfs#13549

@yshui
Copy link
Member Author

yshui commented Oct 6, 2022

Hmm, there is another problem. I worked around this issue (by disabling ZSWAP), but then the resultant kernel does not boot on real hardware, it boots in qemu, however.

@nickdesaulniers nickdesaulniers added [ARCH] x86_64 This bug impacts ARCH=x86_64 [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization [BUG] Untriaged Something isn't working labels Oct 6, 2022
@0n-s
Copy link

0n-s commented Nov 24, 2022

Chiming in as I've also been able to reproduce this exact issue earlier, even with an LLVM toolchain built from Git sources that are only a business week old (some of those TCs from that time had other problems like miscompiling sha512_ssse3, but not really relevant here, & also already fixed in newer revisions).

This seems to no longer be reproducible. I've built a kernel with CONFIG_ZFS=y, CONFIG_JUMP_LABEL=y, & CONFIG_ZSWAP=y (let me know if you would like the exact dotconfig) with fairly fresh LLVM & Clang main, specifically bfc812a2f32698ef383d486c25fa6abc001d6466, with both full & thin LTO (I've been able to reproduce the same issue with full LTO as well before) & the kernel boots just fine on QEMU, Cloud Hypervisor, & several pieces of real HW (a smorgasbord of x86_64 stuff).

I'm not really sure which commit fixed it, but at least the commit I used is no longer affected by this bug, at least from my (admittedly somewhat minimal) testing.

Versions of other things:

  • kernel: c3eb11fbb826879be773c137f281569efce67aa8
  • ZFS: b0657a59abb38659721bf8d973920292c4f4a1a8

@yshui
Copy link
Member Author

yshui commented Nov 24, 2022

@0n-s thanks, i will try to repro again later.

@KyunLFA
Copy link

KyunLFA commented Oct 9, 2023

Hi.
I seem to be hitting this same issue or an extremely similar one, with all but the ZFS being the same, with another out-of-tree CoW fs, Bcachefs (testing branch).

It is built-in as well, with ZSWAP=y and JUMP_LABEL=y, on 6.6 git master.

I will try using Clang+LLVM 18 main to see if its still reproducible there.

The main problem for me is that it does not produce any kernel output on boot, but compiling with CFI and LTO off reveals very similar kernel errors to the ones listed above, specially on non-clean unmounts.

FYI, there are parts of ZFS and Bcachefs that overlap, to the extent of having this many issues with the keyword "bcachefs" (wild guess: might this have something to do with kmem_cache_alloc?) (openzfs/zfs#15143)

Kernel commit: b9ddbb0cde2adcedda26045cc58f31316a492215
LLVM/Clang version: 17.0.2 stable
Bcachefs commit: e1aae900a671cad3ed51c252a0dda0c7e8a89362
OS: Chimera Linux rolling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 [BUG] Untriaged Something isn't working [FEATURE] LTO Related to building the kernel with LLVM Link Time Optimization
Projects
None yet
Development

No branches or pull requests

4 participants