-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scripts/pahole-flags.sh: Enable parallelization of pahole. #22
Conversation
Master branch: f76d850 |
f75a527
to
c30c5af
Compare
Master branch: 9b6eb04 |
c30c5af
to
e5e0c64
Compare
Pass a -j argument to pahole to parse DWARF and generate BTF with multithreading. v2 checks the version of pahole to apply -j only if >= v1.22. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Acked-by: Yonghong Song <yhs@fb.com>
Master branch: 1b8c924 |
e5e0c64
to
6d1e1ee
Compare
Master branch: 9e98ace Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/netdevbpf/list/?series=615083
conflict:
|
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=615083 irrelevant now. Closing PR. |
When bringing down the netdevice or system shutdown, a panic can be triggered while accessing the sysfs path because the device is already removed. [ 755.549084] mlx5_core 0000:12:00.1: Shutdown was called [ 756.404455] mlx5_core 0000:12:00.0: Shutdown was called ... [ 757.937260] BUG: unable to handle kernel NULL pointer dereference at (null) [ 758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280 crash> bt ... PID: 12649 TASK: ffff8924108f2100 CPU: 1 COMMAND: "amsd" ... #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778 [exception RIP: dma_pool_alloc+0x1ab] RIP: ffffffff8ee11acb RSP: ffff89240e1a3968 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff89243d874100 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff89243d874090 RBP: ffff89240e1a39c0 R8: 000000000001f080 R9: ffff8905ffc03c00 R10: ffffffffc04680d4 R11: ffffffff8edde9fd R12: 00000000000080d0 R13: ffff89243d874090 R14: ffff89243d874080 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core] #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core] #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core] #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core] #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core] #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core] #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core] #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46 #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208 #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3 #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596 #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10 #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5 #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92 crash> net_device.state ffff89443b0c0000 state = 0x5 (__LINK_STATE_START| __LINK_STATE_NOCARRIER) To prevent this scenario, we also make sure that the netdevice is present. Signed-off-by: suresh kumar <suresh2514@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
After waking up a suspended VM, the kernel prints the following trace for virtio drivers which do not directly call virtio_device_ready() in the .restore: PM: suspend exit irq 22: nobody cared (try booting with the "irqpoll" option) Call Trace: <IRQ> dump_stack_lvl+0x38/0x49 dump_stack+0x10/0x12 __report_bad_irq+0x3a/0xaf note_interrupt.cold+0xb/0x60 handle_irq_event+0x71/0x80 handle_fasteoi_irq+0x95/0x1e0 __common_interrupt+0x6b/0x110 common_interrupt+0x63/0xe0 asm_common_interrupt+0x1e/0x40 ? __do_softirq+0x75/0x2f3 irq_exit_rcu+0x93/0xe0 sysvec_apic_timer_interrupt+0xac/0xd0 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch_cpu_idle+0x12/0x20 default_idle_call+0x39/0xf0 do_idle+0x1b5/0x210 cpu_startup_entry+0x20/0x30 start_secondary+0xf3/0x100 secondary_startup_64_no_verify+0xc3/0xcb </TASK> handlers: [<000000008f9bac49>] vp_interrupt [<000000008f9bac49>] vp_interrupt Disabling IRQ #22 This happens because we don't invoke .enable_cbs callback in virtio_device_restore(). That callback is used by some transports (e.g. virtio-pci) to enable interrupts. Let's fix it, by calling virtio_device_ready() as we do in virtio_dev_probe(). This function calls .enable_cts callback and sets DRIVER_OK status bit. This fix also avoids setting DRIVER_OK twice for those drivers that call virtio_device_ready() in the .restore. Fixes: d50497e ("virtio_config: introduce a new .enable_cbs method") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20220322114313.116516-1-sgarzare@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tests for races between shinfo_cache (de)activation and hypercall+ioctl() processing. KVM has had bugs where activating the shared info cache multiple times and/or with concurrent users results in lock corruption, NULL pointer dereferences, and other fun. For the timer injection testcase (#22), re-arm the timer until the IRQ is successfully injected. If the timer expires while the shared info is deactivated (invalid), KVM will drop the event. Signed-off-by: Michal Luczaj <mhal@rbox.co> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a system with E810 with existing VFs gets rebooted the following hang may be observed. Pid 1 is hung in iavf_remove(), part of a network driver: PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow" #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930 #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf] #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513 #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429 #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4 #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice] #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice] #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice] #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1 #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386 #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870 #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6 #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159 #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169 #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7 RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90 R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005 R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000 ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b During reboot all drivers PM shutdown callbacks are invoked. In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE. In ice_shutdown() the call chain above is executed, which at some point calls iavf_remove(). However iavf_remove() expects the VF to be in one of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If that's not the case it sleeps forever. So if iavf_shutdown() gets invoked before iavf_remove() the system will hang indefinitely because the adapter is already in state __IAVF_REMOVE. Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE, as we already went through iavf_shutdown(). Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove") Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove") Reported-by: Marius Cornea <mcornea@redhat.com> Signed-off-by: Stefan Assmann <sassmann@kpanic.de> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; r1, r6 22: r0 = r1; r1 23: exit r0 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; r6 11: call pc+10; # global subprog r6 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; r6 10: r2 = &callback; r6 11: call bpf_loop; r6 22: r0 = r1; 23: exit 12: r1 = <map_pointer> r0, r6 13: r1 += r0; r0, r6 14: r1 += r6; r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; fr0: r1, r6 22: r0 = r1; fr0: r6; fr1: r1 23: exit fr0: r6; fr1: r0 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; # global subprog fr0: r6 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; fr0: r6 10: r2 = &callback; fr0: r6 11: call bpf_loop; fr0: r6 22: r0 = r1; fr0: r6 fr1: 23: exit fr0: r6 fr1: 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; fr0: r1, r6 22: r0 = r1; fr0: r6; fr1: r1 23: exit fr0: r6; fr1: r0 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; # global subprog fr0: r6 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; fr0: r6 10: r2 = &callback; fr0: r6 11: call bpf_loop; fr0: r6 22: r0 = r1; fr0: r6 fr1: 23: exit fr0: r6 fr1: 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; fr0: r1, r6 22: r0 = r1; fr0: r6; fr1: r1 23: exit fr0: r6; fr1: r0 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; # global subprog fr0: r6 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; fr0: r6 10: r2 = &callback; fr0: r6 11: call bpf_loop; fr0: r6 22: r0 = r1; fr0: r6 fr1: 23: exit fr0: r6 fr1: 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; fr0: r1, r6 22: r0 = r1; fr0: r6; fr1: r1 23: exit fr0: r6; fr1: r0 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; # global subprog fr0: r6 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; fr0: r6 10: r2 = &callback; fr0: r6 11: call bpf_loop; fr0: r6 22: r0 = r1; fr0: r6 fr1: 23: exit fr0: r6 fr1: 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; fr0: r1, r6 22: r0 = r1; fr0: r6; fr1: r1 23: exit fr0: r6; fr1: r0 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; # global subprog fr0: r6 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; fr0: r6 10: r2 = &callback; fr0: r6 11: call bpf_loop; fr0: r6 22: r0 = r1; fr0: r6 fr1: 23: exit fr0: r6 fr1: 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The cited commit adds a compeletion to remove dependency on rtnl lock. But it causes a deadlock for multiple encapsulations: crash> bt ffff8aece8a64000 PID: 1514557 TASK: ffff8aece8a64000 CPU: 3 COMMAND: "tc" #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418 #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898 #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8 #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core] #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core] #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core] #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core] #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core] #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core] #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core] #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8 #13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower] #14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower] #15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047 #16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31 #17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853 #18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835 #19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27 #20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245 #21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482 #22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a #23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2 #24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2 #25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f #26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8 #27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c crash> bt 0xffff8aeb07544000 PID: 1110766 TASK: ffff8aeb07544000 CPU: 0 COMMAND: "kworker/u20:9" #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418 #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88 #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core] #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core] #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core] #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012 #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f After the first encap is attached, flow will be added to encap entry's flows list. If neigh update is running at this time, the following encaps of the flow can't hold the encap_tbl_lock and sleep. If neigh update thread is waiting for that flow's init_done, deadlock happens. Fix it by holding lock outside of the for loop. If neigh update is running, prevent encap flows from offloading. Since the lock is held outside of the for loop, concurrent creation of encap entries is not allowed. So remove unnecessary wait_for_completion call for res_ready. Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The following processes run into a deadlock. CPU 41 was waiting for CPU 29 to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29 was hung by that spinlock with IRQs disabled. PID: 17360 TASK: ffff95c1090c5c40 CPU: 41 COMMAND: "mrdiagd" !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0 !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0 !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0 # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0 # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0 # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0 # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0 # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0 # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0 # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0 #10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0 #11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0 #12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0 #13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0 #14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0 #15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0 #16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0 #17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0 #18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0 #19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0 #20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0 #21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0 #22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0 #23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0 #24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0 #25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0 #26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0 #27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0 PID: 17355 TASK: ffff95c1090c3d80 CPU: 29 COMMAND: "mrdiagd" !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0 !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0 # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0 # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0 # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0 # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0 # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0 # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0 # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0 # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0 #10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0 #11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0 #12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0 #13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0 #14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0 #15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0 #16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0 #17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0 The lock is used to synchronize different sysfs operations, it doesn't protect any resource that will be touched by an interrupt. Consequently it's not required to disable IRQs. Replace the spinlock with a mutex to fix the deadlock. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
We call bnxt_half_open_nic() to setup the chip partially to run loopback tests. The rings and buffers are initialized normally so that we can transmit and receive packets in loopback mode. That means page pool buffers are allocated for the aggregation ring just like the normal case. NAPI is not needed because we are just polling for the loopback packets. When we're done with the loopback tests, we call bnxt_half_close_nic() to clean up. When freeing the page pools, we hit a WARN_ON() in page_pool_unlink_napi() because the NAPI state linked to the page pool is uninitialized. The simplest way to avoid this warning is just to initialize the NAPIs during half open and delete the NAPIs during half close. Trying to skip the page pool initialization or skip linking of NAPI during half open will be more complicated. This fix avoids this warning: WARNING: CPU: 4 PID: 46967 at net/core/page_pool.c:946 page_pool_unlink_napi+0x1f/0x30 CPU: 4 PID: 46967 Comm: ethtool Tainted: G S W 6.7.0-rc5+ #22 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.3.8 08/31/2021 RIP: 0010:page_pool_unlink_napi+0x1f/0x30 Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 18 48 85 c0 74 1b 48 8b 50 10 83 e2 01 74 08 8b 40 34 83 f8 ff 74 02 <0f> 0b 48 c7 47 18 00 00 00 00 c3 cc cc cc cc 66 90 90 90 90 90 90 RSP: 0018:ffa000003d0dfbe8 EFLAGS: 00010246 RAX: ff110003607ce640 RBX: ff110010baf5d000 RCX: 0000000000000008 RDX: 0000000000000000 RSI: ff110001e5e522c0 RDI: ff110010baf5d000 RBP: ff11000145539b40 R08: 0000000000000001 R09: ffffffffc063f641 R10: ff110001361eddb8 R11: 000000000040000f R12: 0000000000000001 R13: 000000000000001c R14: ff1100014553a080 R15: 0000000000003fc0 FS: 00007f9301c4f740(0000) GS:ff1100103fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f91344fa8f0 CR3: 00000003527cc005 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x81/0x140 ? page_pool_unlink_napi+0x1f/0x30 ? report_bug+0x102/0x200 ? handle_bug+0x44/0x70 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? bnxt_free_ring.isra.123+0xb1/0xd0 [bnxt_en] ? page_pool_unlink_napi+0x1f/0x30 page_pool_destroy+0x3e/0x150 bnxt_free_mem+0x441/0x5e0 [bnxt_en] bnxt_half_close_nic+0x2a/0x40 [bnxt_en] bnxt_self_test+0x21d/0x450 [bnxt_en] __dev_ethtool+0xeda/0x2e30 ? native_queued_spin_lock_slowpath+0x17f/0x2b0 ? __link_object+0xa1/0x160 ? _raw_spin_unlock_irqrestore+0x23/0x40 ? __create_object+0x5f/0x90 ? __kmem_cache_alloc_node+0x317/0x3c0 ? dev_ethtool+0x59/0x170 dev_ethtool+0xa7/0x170 dev_ioctl+0xc3/0x530 sock_do_ioctl+0xa8/0xf0 sock_ioctl+0x270/0x310 __x64_sys_ioctl+0x8c/0xc0 do_syscall_64+0x3e/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 Fixes: 294e39e ("bnxt: hook NAPIs to page pools") Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240117234515.226944-5-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When configuring a hugetlb filesystem via the fsconfig() syscall, there is a possible NULL dereference in hugetlbfs_fill_super() caused by assigning NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize is non valid. E.g: Taking the following steps: fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC); fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0); fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced with NULL, losing its previous value, and we will print an error: ... ... case Opt_pagesize: ps = memparse(param->string, &rest); ctx->hstate = h; if (!ctx->hstate) { pr_err("Unsupported page size %lu MB\n", ps / SZ_1M); return -EINVAL; } return 0; ... ... This is a problem because later on, we will dereference ctxt->hstate in hugetlbfs_fill_super() ... ... sb->s_blocksize = huge_page_size(ctx->hstate); ... ... Causing below Oops. Fix this by replacing cxt->hstate value only when then pagesize is known to be valid. kernel: hugetlbfs: Unsupported page size 0 MB kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028 kernel: #PF: supervisor read access in kernel mode kernel: #PF: error_code(0x0000) - not-present page kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0 kernel: Oops: 0000 [#1] PREEMPT SMP PTI kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G E 6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 kernel: Call Trace: kernel: <TASK> kernel: ? __die_body+0x1a/0x60 kernel: ? page_fault_oops+0x16f/0x4a0 kernel: ? search_bpf_extables+0x65/0x70 kernel: ? fixup_exception+0x22/0x310 kernel: ? exc_page_fault+0x69/0x150 kernel: ? asm_exc_page_fault+0x22/0x30 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: ? hugetlbfs_fill_super+0xb4/0x1a0 kernel: ? hugetlbfs_fill_super+0x28/0x1a0 kernel: ? __pfx_hugetlbfs_fill_super+0x10/0x10 kernel: vfs_get_super+0x40/0xa0 kernel: ? __pfx_bpf_lsm_capable+0x10/0x10 kernel: vfs_get_tree+0x25/0xd0 kernel: vfs_cmd_create+0x64/0xe0 kernel: __x64_sys_fsconfig+0x395/0x410 kernel: do_syscall_64+0x80/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? syscall_exit_to_user_mode+0x82/0x240 kernel: ? do_syscall_64+0x8d/0x160 kernel: ? exc_page_fault+0x69/0x150 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76 kernel: RIP: 0033:0x7ffbc0cb87c9 kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48 kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000 kernel: </TASK> kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E) kernel: mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E) kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1 kernel: CR2: 0000000000000028 kernel: ---[ end trace 0000000000000000 ]--- kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400 kernel: FS: 00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0 Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de Fixes: 3202198 ("hugetlbfs: Convert to fs_context") Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the skb is reorganized during esp_output (!esp->inline), the pages coming from the original skb fragments are supposed to be released back to the system through put_page. But if the skb fragment pages are originating from a page_pool, calling put_page on them will trigger a page_pool leak which will eventually result in a crash. This leak can be easily observed when using CONFIG_DEBUG_VM and doing ipsec + gre (non offloaded) forwarding: BUG: Bad page state in process ksoftirqd/16 pfn:1451b6 page:00000000de2b8d32 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1451b6000 pfn:0x1451b6 flags: 0x200000000000000(node=0|zone=2) page_type: 0xffffffff() raw: 0200000000000000 dead000000000040 ffff88810d23c000 0000000000000000 raw: 00000001451b6000 0000000000000001 00000000ffffffff 0000000000000000 page dumped because: page_pool leak Modules linked in: ip_gre gre mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay zram zsmalloc fuse [last unloaded: mlx5_core] CPU: 16 PID: 96 Comm: ksoftirqd/16 Not tainted 6.8.0-rc4+ #22 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x36/0x50 bad_page+0x70/0xf0 free_unref_page_prepare+0x27a/0x460 free_unref_page+0x38/0x120 esp_ssg_unref.isra.0+0x15f/0x200 esp_output_tail+0x66d/0x780 esp_xmit+0x2c5/0x360 validate_xmit_xfrm+0x313/0x370 ? validate_xmit_skb+0x1d/0x330 validate_xmit_skb_list+0x4c/0x70 sch_direct_xmit+0x23e/0x350 __dev_queue_xmit+0x337/0xba0 ? nf_hook_slow+0x3f/0xd0 ip_finish_output2+0x25e/0x580 iptunnel_xmit+0x19b/0x240 ip_tunnel_xmit+0x5fb/0xb60 ipgre_xmit+0x14d/0x280 [ip_gre] dev_hard_start_xmit+0xc3/0x1c0 __dev_queue_xmit+0x208/0xba0 ? nf_hook_slow+0x3f/0xd0 ip_finish_output2+0x1ca/0x580 ip_sublist_rcv_finish+0x32/0x40 ip_sublist_rcv+0x1b2/0x1f0 ? ip_rcv_finish_core.constprop.0+0x460/0x460 ip_list_rcv+0x103/0x130 __netif_receive_skb_list_core+0x181/0x1e0 netif_receive_skb_list_internal+0x1b3/0x2c0 napi_gro_receive+0xc8/0x200 gro_cell_poll+0x52/0x90 __napi_poll+0x25/0x1a0 net_rx_action+0x28e/0x300 __do_softirq+0xc3/0x276 ? sort_range+0x20/0x20 run_ksoftirqd+0x1e/0x30 smpboot_thread_fn+0xa6/0x130 kthread+0xcd/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x31/0x50 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 </TASK> The suggested fix is to introduce a new wrapper (skb_page_unref) that covers page refcounting for page_pool pages as well. Cc: stable@vger.kernel.org Fixes: 6a5bcd8 ("page_pool: Allow drivers to hint on SKB recycling") Reported-and-tested-by: Anatoli N.Chechelnickiy <Anatoli.Chechelnickiy@m.interpipe.biz> Reported-by: Ian Kumlien <ian.kumlien@gmail.com> Link: https://lore.kernel.org/netdev/CAA85sZvvHtrpTQRqdaOx6gd55zPAVsqMYk_Lwh4Md5knTq7AyA@mail.gmail.com Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
ui_browser__show() is capturing the input title that is stack allocated memory in hist_browser__run(). Avoid a use after return by strdup-ing the string. Committer notes: Further explanation from Ian Rogers: My command line using tui is: $ sudo bash -c 'rm /tmp/asan.log*; export ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a sleep 1; /tmp/perf/perf mem report' I then go to the perf annotate view and quit. This triggers the asan error (from the log file): ``` ==1254591==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f2813331920 at pc 0x7f28180 65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10 READ of size 80 at 0x7f2813331920 thread T0 #0 0x7f2818065990 in __interceptor_strlen ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461 #1 0x7f2817698251 in SLsmg_write_wrapped_string (/lib/x86_64-linux-gnu/libslang.so.2+0x98251) #2 0x7f28176984b9 in SLsmg_write_nstring (/lib/x86_64-linux-gnu/libslang.so.2+0x984b9) #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60 #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266 #5 0x55c94045c776 in ui_browser__show ui/browser.c:288 #6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206 #7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458 #8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412 #9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527 #10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613 #11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661 #12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671 #13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141 #14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805 #15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374 #16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516 #17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350 #18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403 #19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447 #20 0x55c9400e53ad in main tools/perf/perf.c:561 #21 0x7f28170456c9 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360 #23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId: 84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93) Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746 This frame has 1 object(s): [32, 192) 'title' (line 747) <== Memory access at offset 32 is inside this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork ``` hist_browser__run isn't on the stack so the asan error looks legit. There's no clean init/exit on struct ui_browser so I may be trading a use-after-return for a memory leak, but that seems look a good trade anyway. Fixes: 05e8b08 ("perf ui browser: Stop using 'self'") Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Cc: Ben Gainey <ben.gainey@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Li Dong <lidong@vivo.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paran Lee <p4ranlee@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Sun Haiyong <sunhaiyong@loongson.cn> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Yanteng Si <siyanteng@loongson.cn> Cc: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20240507183545.1236093-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
…uddy pages When I did memory failure tests recently, below panic occurs: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page)) ------------[ cut here ]------------ kernel BUG at include/linux/page-flags.h:1009! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:__del_page_from_free_list+0x151/0x180 RSP: 0018:ffffa49c90437998 EFLAGS: 00000046 RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0 RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69 R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80 R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009 FS: 00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0 Call Trace: <TASK> __rmqueue_pcplist+0x23b/0x520 get_page_from_freelist+0x26b/0xe40 __alloc_pages_noprof+0x113/0x1120 __folio_alloc_noprof+0x11/0xb0 alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130 __alloc_fresh_hugetlb_folio+0xe7/0x140 alloc_pool_huge_folio+0x68/0x100 set_max_huge_pages+0x13d/0x340 hugetlb_sysctl_handler_common+0xe8/0x110 proc_sys_call_handler+0x194/0x280 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff916114887 RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887 RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003 RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0 R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004 R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00 </TASK> Modules linked in: mce_inject hwpoison_inject ---[ end trace 0000000000000000 ]--- And before the panic, there had an warning about bad page state: BUG: Bad page state in process page-types pfn:8cee00 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) page_type: 0xffffff7f(buddy) raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: mce_inject hwpoison_inject CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22 Call Trace: <TASK> dump_stack_lvl+0x83/0xa0 bad_page+0x63/0xf0 free_unref_page+0x36e/0x5c0 unpoison_memory+0x50b/0x630 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110 debugfs_attr_write+0x42/0x60 full_proxy_write+0x5b/0x80 vfs_write+0xcd/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f189a514887 RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887 RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003 RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8 R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040 </TASK> The root cause should be the below race: memory_failure try_memory_failure_hugetlb me_huge_page __page_handle_poison dissolve_free_hugetlb_folio drain_all_pages -- Buddy page can be isolated e.g. for compaction. take_page_off_buddy -- Failed as page is not in the buddy list. -- Page can be putback into buddy after compaction. page_ref_inc -- Leads to buddy page with refcnt = 1. Then unpoison_memory() can unpoison the page and send the buddy page back into buddy list again leading to the above bad page state warning. And bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to allocate this page. Fix this issue by only treating __page_handle_poison() as successful when it returns 1. Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com Fixes: ceaf8fb ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This updates iso_sock_accept to use nested locking for the parent socket, to avoid lockdep warnings caused because the parent and child sockets are locked by the same thread: [ 41.585683] ============================================ [ 41.585688] WARNING: possible recursive locking detected [ 41.585694] 6.12.0-rc6+ #22 Not tainted [ 41.585701] -------------------------------------------- [ 41.585705] iso-tester/3139 is trying to acquire lock: [ 41.585711] ffff988b29530a58 (sk_lock-AF_BLUETOOTH) at: bt_accept_dequeue+0xe3/0x280 [bluetooth] [ 41.585905] but task is already holding lock: [ 41.585909] ffff988b29533a58 (sk_lock-AF_BLUETOOTH) at: iso_sock_accept+0x61/0x2d0 [bluetooth] [ 41.586064] other info that might help us debug this: [ 41.586069] Possible unsafe locking scenario: [ 41.586072] CPU0 [ 41.586076] ---- [ 41.586079] lock(sk_lock-AF_BLUETOOTH); [ 41.586086] lock(sk_lock-AF_BLUETOOTH); [ 41.586093] *** DEADLOCK *** [ 41.586097] May be due to missing lock nesting notation [ 41.586101] 1 lock held by iso-tester/3139: [ 41.586107] #0: ffff988b29533a58 (sk_lock-AF_BLUETOOTH) at: iso_sock_accept+0x61/0x2d0 [bluetooth] Fixes: ccf74f2 ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes the circular locking dependency warning below, by releasing the socket lock before enterning iso_listen_bis, to avoid any potential deadlock with hdev lock. [ 75.307983] ====================================================== [ 75.307984] WARNING: possible circular locking dependency detected [ 75.307985] 6.12.0-rc6+ #22 Not tainted [ 75.307987] ------------------------------------------------------ [ 75.307987] kworker/u81:2/2623 is trying to acquire lock: [ 75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO) at: iso_connect_cfm+0x253/0x840 [bluetooth] [ 75.308021] but task is already holding lock: [ 75.308022] ffff8fdd61a10078 (&hdev->lock) at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth] [ 75.308053] which lock already depends on the new lock. [ 75.308054] the existing dependency chain (in reverse order) is: [ 75.308055] -> #1 (&hdev->lock){+.+.}-{3:3}: [ 75.308057] __mutex_lock+0xad/0xc50 [ 75.308061] mutex_lock_nested+0x1b/0x30 [ 75.308063] iso_sock_listen+0x143/0x5c0 [bluetooth] [ 75.308085] __sys_listen_socket+0x49/0x60 [ 75.308088] __x64_sys_listen+0x4c/0x90 [ 75.308090] x64_sys_call+0x2517/0x25f0 [ 75.308092] do_syscall_64+0x87/0x150 [ 75.308095] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 75.308098] -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}: [ 75.308100] __lock_acquire+0x155e/0x25f0 [ 75.308103] lock_acquire+0xc9/0x300 [ 75.308105] lock_sock_nested+0x32/0x90 [ 75.308107] iso_connect_cfm+0x253/0x840 [bluetooth] [ 75.308128] hci_connect_cfm+0x6c/0x190 [bluetooth] [ 75.308155] hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth] [ 75.308180] hci_le_meta_evt+0xe7/0x200 [bluetooth] [ 75.308206] hci_event_packet+0x21f/0x5c0 [bluetooth] [ 75.308230] hci_rx_work+0x3ae/0xb10 [bluetooth] [ 75.308254] process_one_work+0x212/0x740 [ 75.308256] worker_thread+0x1bd/0x3a0 [ 75.308258] kthread+0xe4/0x120 [ 75.308259] ret_from_fork+0x44/0x70 [ 75.308261] ret_from_fork_asm+0x1a/0x30 [ 75.308263] other info that might help us debug this: [ 75.308264] Possible unsafe locking scenario: [ 75.308264] CPU0 CPU1 [ 75.308265] ---- ---- [ 75.308265] lock(&hdev->lock); [ 75.308267] lock(sk_lock- AF_BLUETOOTH-BTPROTO_ISO); [ 75.308268] lock(&hdev->lock); [ 75.308269] lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO); [ 75.308270] *** DEADLOCK *** [ 75.308271] 4 locks held by kworker/u81:2/2623: [ 75.308272] #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work+0x443/0x740 [ 75.308276] #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)), at: process_one_work+0x1ce/0x740 [ 75.308280] #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3} at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth] [ 75.308304] #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2}, at: hci_connect_cfm+0x29/0x190 [bluetooth] Fixes: 02171da ("Bluetooth: ISO: Add hcon for listening bis sk") Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pull request for series with
subject: scripts/pahole-flags.sh: Enable parallelization of pahole.
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=615083