-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf, arm64: support more atomic ops #24
Conversation
Master branch: 9b6eb04 |
Master branch: 1b8c924 |
620c991
to
3c5b6af
Compare
Master branch: 9e98ace |
3c5b6af
to
b057704
Compare
9e98ace
to
1b8c924
Compare
Master branch: 1b8c924 |
b057704
to
ea4dbf4
Compare
Master branch: b38101c |
ea4dbf4
to
7880cd5
Compare
Master branch: b75daca |
7880cd5
to
dbde359
Compare
Master branch: d24d2a2 |
dbde359
to
a16a4f4
Compare
Master branch: 086d490 |
a16a4f4
to
2d2135a
Compare
When bringing down the netdevice or system shutdown, a panic can be triggered while accessing the sysfs path because the device is already removed. [ 755.549084] mlx5_core 0000:12:00.1: Shutdown was called [ 756.404455] mlx5_core 0000:12:00.0: Shutdown was called ... [ 757.937260] BUG: unable to handle kernel NULL pointer dereference at (null) [ 758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280 crash> bt ... PID: 12649 TASK: ffff8924108f2100 CPU: 1 COMMAND: "amsd" ... #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778 [exception RIP: dma_pool_alloc+0x1ab] RIP: ffffffff8ee11acb RSP: ffff89240e1a3968 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff89243d874100 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff89243d874090 RBP: ffff89240e1a39c0 R8: 000000000001f080 R9: ffff8905ffc03c00 R10: ffffffffc04680d4 R11: ffffffff8edde9fd R12: 00000000000080d0 R13: ffff89243d874090 R14: ffff89243d874080 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core] #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core] #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core] #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core] #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core] #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core] #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core] #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46 #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208 #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3 #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596 #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10 #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5 #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92 crash> net_device.state ffff89443b0c0000 state = 0x5 (__LINK_STATE_START| __LINK_STATE_NOCARRIER) To prevent this scenario, we also make sure that the netdevice is present. Signed-off-by: suresh kumar <suresh2514@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Master branch: 9087c6f |
2d2135a
to
eda3c5f
Compare
Master branch: a33c0c7 |
eda3c5f
to
52cf388
Compare
Master branch: d0b3822 |
52cf388
to
e8e1b5b
Compare
Master branch: b03e194 |
e8e1b5b
to
988fa92
Compare
Master branch: 13c6a37 |
988fa92
to
b580501
Compare
The current implementation of the mov instruction with sign extension has the following problems: 1. It clobbers the source register if it is not stacked because it sign extends the source and then moves it to the destination. 2. If the dst_reg is stacked, the current code doesn't write the value back in case of 64-bit mov. 3. There is room for improvement by emitting fewer instructions. The steps for fixing this and the instructions emitted by the JIT are explained below with examples in all combinations: Case A: offset == 32: ===================== Case A.1: src and dst are stacked registers: -------------------------------------------- 1. Load src_lo into tmp_lo 2. Store tmp_lo into dst_lo 3. Sign extend tmp_lo into tmp_hi 4. Store tmp_hi to dst_hi Example: r3 = (s32)r3 r3 is a stacked register ldr r6, [r11, #-16] // Load r3_lo into tmp_lo // str to dst_lo is not emitted because src_lo == dst_lo asr r7, r6, #31 // Sign extend tmp_lo into tmp_hi str r7, [r11, #-12] // Store tmp_hi into r3_hi Case A.2: src is stacked but dst is not: ---------------------------------------- 1. Load src_lo into dst_lo 2. Sign extend dst_lo into dst_hi Example: r6 = (s32)r3 r6 maps to {ARM_R5, ARM_R4} and r3 is stacked ldr r4, [r11, #-16] // Load r3_lo into r6_lo asr r5, r4, #31 // Sign extend r6_lo into r6_hi Case A.3: src is not stacked but dst is stacked: ------------------------------------------------ 1. Store src_lo into dst_lo 2. Sign extend src_lo into tmp_hi 3. Store tmp_hi to dst_hi Example: r3 = (s32)r6 r3 is stacked and r6 maps to {ARM_R5, ARM_R4} str r4, [r11, #-16] // Store r6_lo to r3_lo asr r7, r4, #31 // Sign extend r6_lo into tmp_hi str r7, [r11, #-12] // Store tmp_hi to dest_hi Case A.4: Both src and dst are not stacked: ------------------------------------------- 1. Mov src_lo into dst_lo 2. Sign extend src_lo into dst_hi Example: (bf) r6 = (s32)r6 r6 maps to {ARM_R5, ARM_R4} // Mov not emitted because dst == src asr r5, r4, #31 // Sign extend r6_lo into r6_hi Case B: offset != 32: ===================== Case B.1: src and dst are stacked registers: -------------------------------------------- 1. Load src_lo into tmp_lo 2. Sign extend tmp_lo according to offset. 3. Store tmp_lo into dst_lo 4. Sign extend tmp_lo into tmp_hi 5. Store tmp_hi to dst_hi Example: r9 = (s8)r3 r9 and r3 are both stacked registers ldr r6, [r11, #-16] // Load r3_lo into tmp_lo lsl r6, r6, #24 // Sign extend tmp_lo asr r6, r6, #24 // .. str r6, [r11, #-56] // Store tmp_lo to r9_lo asr r7, r6, #31 // Sign extend tmp_lo to tmp_hi str r7, [r11, #-52] // Store tmp_hi to r9_hi Case B.2: src is stacked but dst is not: ---------------------------------------- 1. Load src_lo into dst_lo 2. Sign extend dst_lo according to offset. 3. Sign extend tmp_lo into dst_hi Example: r6 = (s8)r3 r6 maps to {ARM_R5, ARM_R4} and r3 is stacked ldr r4, [r11, #-16] // Load r3_lo to r6_lo lsl r4, r4, #24 // Sign extend r6_lo asr r4, r4, #24 // .. asr r5, r4, #31 // Sign extend r6_lo into r6_hi Case B.3: src is not stacked but dst is stacked: ------------------------------------------------ 1. Sign extend src_lo into tmp_lo according to offset. 2. Store tmp_lo into dst_lo. 3. Sign extend src_lo into tmp_hi. 4. Store tmp_hi to dst_hi. Example: r3 = (s8)r1 r3 is stacked and r1 maps to {ARM_R3, ARM_R2} lsl r6, r2, #24 // Sign extend r1_lo to tmp_lo asr r6, r6, #24 // .. str r6, [r11, #-16] // Store tmp_lo to r3_lo asr r7, r6, #31 // Sign extend tmp_lo to tmp_hi str r7, [r11, #-12] // Store tmp_hi to r3_hi Case B.4: Both src and dst are not stacked: ------------------------------------------- 1. Sign extend src_lo into dst_lo according to offset. 2. Sign extend dst_lo into dst_hi. Example: r6 = (s8)r1 r6 maps to {ARM_R5, ARM_R4} and r1 maps to {ARM_R3, ARM_R2} lsl r4, r2, #24 // Sign extend r1_lo to r6_lo asr r4, r4, #24 // .. asr r5, r4, #31 // Sign extend r6_lo to r6_hi Fixes: fc83265 ("arm32, bpf: add support for sign-extension mov instruction") Reported-by: syzbot+186522670e6722692d86@syzkaller.appspotmail.com Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Closes: https://lore.kernel.org/all/000000000000e9a8d80615163f2a@google.com Link: https://lore.kernel.org/bpf/20240419182832.27707-1-puranjay@kernel.org
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org>
Inline calls to bpf_get_smp_processor_id() helper in the JIT by emitting a read from struct thread_info. The SP_EL0 system register holds the pointer to the task_struct and thread_info is the first member of this struct. We can read the cpu number from the thread_info. Here is how the ARM64 JITed assembly changes after this commit: ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 Performance improvement using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240502151854.9810-5-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Puranjay Mohan says: ==================== bpf: Inline helpers in arm64 and riscv JITs Changes in v5 -> v6: arm64 v5: https://lore.kernel.org/all/20240430234739.79185-1-puranjay@kernel.org/ riscv v2: https://lore.kernel.org/all/20240430175834.33152-1-puranjay@kernel.org/ - Combine riscv and arm64 changes in single series - Some coding style fixes Changes in v4 -> v5: v4: https://lore.kernel.org/all/20240429131647.50165-1-puranjay@kernel.org/ - Implement the inlining of the bpf_get_smp_processor_id() in the JIT. NOTE: This needs to be based on: https://lore.kernel.org/all/20240430175834.33152-1-puranjay@kernel.org/ to be built. Manual run of bpf-ci with this series rebased on above: kernel-patches/bpf#6929 Changes in v3 -> v4: v3: https://lore.kernel.org/all/20240426121349.97651-1-puranjay@kernel.org/ - Fix coding style issue related to C89 standards. Changes in v2 -> v3: v2: https://lore.kernel.org/all/20240424173550.16359-1-puranjay@kernel.org/ - Fixed the xlated dump of percpu mov to "r0 = &(void __percpu *)(r0)" - Made ARM64 and x86-64 use the same code for inlining. The only difference that remains is the per-cpu address of the cpu_number. Changes in v1 -> v2: v1: https://lore.kernel.org/all/20240405091707.66675-1-puranjay12@gmail.com/ - Add a patch to inline bpf_get_smp_processor_id() - Fix an issue in MRS instruction encoding as pointed out by Will - Remove CONFIG_SMP check because arm64 kernel always compiles with CONFIG_SMP This series adds the support of internal only per-CPU instructions and inlines the bpf_get_smp_processor_id() helper call for ARM64 and RISC-V BPF JITs. Here is an example of calls to bpf_get_smp_processor_id() and percpu_array_map_lookup_elem() before and after this series on ARM64. BPF ===== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); (85) call bpf_get_smp_processor_id#229032 (85) call bpf_get_smp_processor_id#8 p = bpf_map_lookup_elem(map, &zero); p = bpf_map_lookup_elem(map, &zero); (18) r1 = map[id:78] (18) r1 = map[id:153] (18) r2 = map[id:82][0]+65536 (18) r2 = map[id:157][0]+65536 (85) call percpu_array_map_lookup_elem#313512 (07) r1 += 496 (61) r0 = *(u32 *)(r2 +0) (35) if r0 >= 0x1 goto pc+5 (67) r0 <<= 3 (0f) r0 += r1 (79) r0 = *(u64 *)(r0 +0) (bf) r0 = &(void __percpu *)(r0) (05) goto pc+1 (b7) r0 = 0 ARM64 JIT =========== BEFORE AFTER -------- ------- int cpu = bpf_get_smp_processor_id(); int cpu = bpf_get_smp_processor_id(); mov x10, #0xfffffffffffff4d0 mrs x10, sp_el0 movk x10, #0x802b, lsl #16 ldr w7, [x10, #24] movk x10, #0x8000, lsl #32 blr x10 add x7, x0, #0x0 p = bpf_map_lookup_elem(map, &zero); p = bpf_map_lookup_elem(map, &zero); mov x0, #0xffff0003ffffffff mov x0, #0xffff0003ffffffff movk x0, #0xce5c, lsl #16 movk x0, #0xe0f3, lsl #16 movk x0, #0xca00 movk x0, #0x7c00 mov x1, #0xffff8000ffffffff mov x1, #0xffff8000ffffffff movk x1, #0x8bdb, lsl #16 movk x1, #0xb0c7, lsl #16 movk x1, #0x6000 movk x1, #0xe000 mov x10, #0xffffffffffff3ed0 add x0, x0, #0x1f0 movk x10, #0x802d, lsl #16 ldr w7, [x1] movk x10, #0x8000, lsl #32 cmp x7, #0x1 blr x10 b.cs 0x0000000000000090 add x7, x0, #0x0 lsl x7, x7, #3 add x7, x7, x0 ldr x7, [x7] mrs x10, tpidr_el1 add x7, x7, x10 b 0x0000000000000094 mov x7, #0x0 Performance improvement found using benchmark[1] ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+-------------------+-------------------+--------------+ | Name | Before | After | % change | |---------------+-------------------+-------------------+--------------| | glob-arr-inc | 23.380 ± 1.675M/s | 25.893 ± 0.026M/s | + 10.74% | | arr-inc | 23.928 ± 0.034M/s | 25.213 ± 0.063M/s | + 5.37% | | hash-inc | 12.352 ± 0.005M/s | 12.609 ± 0.013M/s | + 2.08% | +---------------+-------------------+-------------------+--------------+ [1] anakryiko/linux@8dec900975ef RISCV64 JIT output for `call bpf_get_smp_processor_id` ======================================================= Before After -------- ------- auipc t1,0x848c ld a5,32(tp) jalr 604(t1) mv a5,a0 Benchmark using [1] on Qemu. ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc +---------------+------------------+------------------+--------------+ | Name | Before | After | % change | |---------------+------------------+------------------+--------------| | glob-arr-inc | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s | + 24.04% | | arr-inc | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s | + 23.56% | | hash-inc | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s | + 32.18% | +---------------+------------------+------------------+--------------+ ==================== Link: https://lore.kernel.org/r/20240502151854.9810-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
profile->parent->dents[AAFS_PROF_DIR] could be NULL only if its parent is made from __create_missing_ancestors(..) and 'ent->old' is NULL in aa_replace_profiles(..). In that case, it must return an error code and the code, -ENOENT represents its state that the path of its parent is not existed yet. BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 PREEMPT SMP PTI CPU: 4 PID: 3362 Comm: apparmor_parser Not tainted 6.8.0-24-generic #24 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:aafs_create.constprop.0+0x7f/0x130 Code: 4c 63 e0 48 83 c4 18 4c 89 e0 5b 41 5c 41 5d 41 5e 41 5f 5d 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9 45 31 d2 c3 cc cc cc cc <4d> 8b 55 30 4d 8d ba a0 00 00 00 4c 89 55 c0 4c 89 ff e8 7a 6a ae RSP: 0018:ffffc9000b2c7c98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000041ed RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc9000b2c7cd8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82baac10 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007be9f22cf740(0000) GS:ffff88817bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000134b08000 CR4: 00000000000006f0 Call Trace: <TASK> ? show_regs+0x6d/0x80 ? __die+0x24/0x80 ? page_fault_oops+0x99/0x1b0 ? kernelmode_fixup_or_oops+0xb2/0x140 ? __bad_area_nosemaphore+0x1a5/0x2c0 ? find_vma+0x34/0x60 ? bad_area_nosemaphore+0x16/0x30 ? do_user_addr_fault+0x2a2/0x6b0 ? exc_page_fault+0x83/0x1b0 ? asm_exc_page_fault+0x27/0x30 ? aafs_create.constprop.0+0x7f/0x130 ? aafs_create.constprop.0+0x51/0x130 __aafs_profile_mkdir+0x3d6/0x480 aa_replace_profiles+0x83f/0x1270 policy_update+0xe3/0x180 profile_load+0xbc/0x150 ? rw_verify_area+0x47/0x140 vfs_write+0x100/0x480 ? __x64_sys_openat+0x55/0xa0 ? syscall_exit_to_user_mode+0x86/0x260 ksys_write+0x73/0x100 __x64_sys_write+0x19/0x30 x64_sys_call+0x7e/0x25c0 do_syscall_64+0x7f/0x180 entry_SYSCALL_64_after_hwframe+0x78/0x80 RIP: 0033:0x7be9f211c574 Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 RSP: 002b:00007ffd26f2b8c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00005d504415e200 RCX: 00007be9f211c574 RDX: 0000000000001fc1 RSI: 00005d504418bc80 RDI: 0000000000000004 RBP: 0000000000001fc1 R08: 0000000000001fc1 R09: 0000000080000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00005d504418bc80 R13: 0000000000000004 R14: 00007ffd26f2b9b0 R15: 00007ffd26f2ba30 </TASK> Modules linked in: snd_seq_dummy snd_hrtimer qrtr snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device i2c_i801 snd_timer i2c_smbus qxl snd soundcore drm_ttm_helper lpc_ich ttm joydev input_leds serio_raw mac_hid binfmt_misc msr parport_pc ppdev lp parport efi_pstore nfnetlink dmi_sysfs qemu_fw_cfg ip_tables x_tables autofs4 hid_generic usbhid hid ahci libahci psmouse virtio_rng xhci_pci xhci_pci_renesas CR2: 0000000000000030 ---[ end trace 0000000000000000 ]--- RIP: 0010:aafs_create.constprop.0+0x7f/0x130 Code: 4c 63 e0 48 83 c4 18 4c 89 e0 5b 41 5c 41 5d 41 5e 41 5f 5d 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9 45 31 d2 c3 cc cc cc cc <4d> 8b 55 30 4d 8d ba a0 00 00 00 4c 89 55 c0 4c 89 ff e8 7a 6a ae RSP: 0018:ffffc9000b2c7c98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000000041ed RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc9000b2c7cd8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82baac10 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007be9f22cf740(0000) GS:ffff88817bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000134b08000 CR4: 00000000000006f0 Signed-off-by: Leesoo Ahn <lsahn@ooseel.net> Signed-off-by: John Johansen <john.johansen@canonical.com>
scx_ops_bypass() can currently race on the ops enable / disable path as follows: 1. scx_ops_bypass(true) called on enable path, bypass depth is set to 1 2. An op on the init path exits, which schedules scx_ops_disable_workfn() 3. scx_ops_bypass(false) is called on the disable path, and bypass depth is decremented to 0 4. kthread is scheduled to execute scx_ops_disable_workfn() 5. scx_ops_bypass(true) called, bypass depth set to 1 6. scx_ops_bypass() races when iterating over CPUs While it's not safe to take any blocking locks on the bypass path, it is safe to take a raw spinlock which cannot be preempted. This patch therefore updates scx_ops_bypass() to use a raw spinlock to synchronize, and changes scx_ops_bypass_depth to be a regular int. Without this change, we observe the following warnings when running the 'exit' sched_ext selftest (sometimes requires a couple of runs): .[root@virtme-ng sched_ext]# ./runner -t exit ===== START ===== TEST: exit ... [ 14.935078] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4332 scx_ops_bypass+0x1ca/0x280 [ 14.935126] Modules linked in: [ 14.935150] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Not tainted 6.11.0-virtme #24 [ 14.935192] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 14.935242] Sched_ext: exit (enabling+all) [ 14.935244] RIP: 0010:scx_ops_bypass+0x1ca/0x280 [ 14.935300] Code: ff ff ff e8 48 96 10 00 fb e9 08 ff ff ff c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 0f 0b 90 41 8b 84 24 24 [ 14.935394] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010002 [ 14.935424] RAX: 0000000000000009 RBX: 0000000000000001 RCX: 00000000e3fb8b2a [ 14.935465] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff88a4c080 [ 14.935512] RBP: 0000000000009b56 R08: 0000000000000004 R09: 00000003f12e520a [ 14.935555] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300 [ 14.935598] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018 [ 14.935642] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000 [ 14.935684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.935721] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0 [ 14.935765] PKRU: 55555554 [ 14.935782] Call Trace: [ 14.935802] <TASK> [ 14.935823] ? __warn+0xce/0x220 [ 14.935850] ? scx_ops_bypass+0x1ca/0x280 [ 14.935881] ? report_bug+0xc1/0x160 [ 14.935909] ? handle_bug+0x61/0x90 [ 14.935934] ? exc_invalid_op+0x1a/0x50 [ 14.935959] ? asm_exc_invalid_op+0x1a/0x20 [ 14.935984] ? raw_spin_rq_lock_nested+0x15/0x30 [ 14.936019] ? scx_ops_bypass+0x1ca/0x280 [ 14.936046] ? srso_alias_return_thunk+0x5/0xfbef5 [ 14.936081] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.936111] scx_ops_disable_workfn+0x146/0xac0 [ 14.936142] ? finish_task_switch+0xa9/0x2c0 [ 14.936172] ? srso_alias_return_thunk+0x5/0xfbef5 [ 14.936211] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.936244] kthread_worker_fn+0x101/0x2c0 [ 14.936268] ? __pfx_kthread_worker_fn+0x10/0x10 [ 14.936299] kthread+0xec/0x110 [ 14.936327] ? __pfx_kthread+0x10/0x10 [ 14.936351] ret_from_fork+0x37/0x50 [ 14.936374] ? __pfx_kthread+0x10/0x10 [ 14.936400] ret_from_fork_asm+0x1a/0x30 [ 14.936427] </TASK> [ 14.936443] irq event stamp: 21002 [ 14.936467] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0 [ 14.936521] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280 [ 14.936571] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.936622] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.936672] ---[ end trace 0000000000000000 ]--- [ 14.953282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 14.953352] ------------[ cut here ]------------ [ 14.953383] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4335 scx_ops_bypass+0x1d8/0x280 [ 14.953428] Modules linked in: [ 14.953453] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Tainted: G W 6.11.0-virtme #24 [ 14.953505] Tainted: [W]=WARN [ 14.953527] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 14.953574] RIP: 0010:scx_ops_bypass+0x1d8/0x280 [ 14.953603] Code: c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 0f 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 92 f3 0f 1e fa 49 8d 84 24 f0 [ 14.953693] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010046 [ 14.953722] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 [ 14.953763] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fc5fec31318 [ 14.953804] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 [ 14.953845] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300 [ 14.953888] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018 [ 14.953934] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000 [ 14.953974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.954009] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0 [ 14.954052] PKRU: 55555554 [ 14.954068] Call Trace: [ 14.954085] <TASK> [ 14.954102] ? __warn+0xce/0x220 [ 14.954126] ? scx_ops_bypass+0x1d8/0x280 [ 14.954150] ? report_bug+0xc1/0x160 [ 14.954178] ? handle_bug+0x61/0x90 [ 14.954203] ? exc_invalid_op+0x1a/0x50 [ 14.954226] ? asm_exc_invalid_op+0x1a/0x20 [ 14.954250] ? raw_spin_rq_lock_nested+0x15/0x30 [ 14.954285] ? scx_ops_bypass+0x1d8/0x280 [ 14.954311] ? __mutex_unlock_slowpath+0x3a/0x260 [ 14.954343] scx_ops_disable_workfn+0xa3e/0xac0 [ 14.954381] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.954413] kthread_worker_fn+0x101/0x2c0 [ 14.954442] ? __pfx_kthread_worker_fn+0x10/0x10 [ 14.954479] kthread+0xec/0x110 [ 14.954507] ? __pfx_kthread+0x10/0x10 [ 14.954530] ret_from_fork+0x37/0x50 [ 14.954553] ? __pfx_kthread+0x10/0x10 [ 14.954576] ret_from_fork_asm+0x1a/0x30 [ 14.954603] </TASK> [ 14.954621] irq event stamp: 21002 [ 14.954644] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0 [ 14.954686] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280 [ 14.954735] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.954782] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.954829] ---[ end trace 0000000000000000 ]--- [ 15.022283] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 15.092282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 15.149282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) ok 1 exit # ===== END ===== And with it, the test passes without issue after 1000s of runs: .[root@virtme-ng sched_ext]# ./runner -t exit ===== START ===== TEST: exit DESCRIPTION: Verify we can cleanly exit a scheduler in multiple places OUTPUT: [ 7.412856] sched_ext: BPF scheduler "exit" enabled [ 7.427924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.466677] sched_ext: BPF scheduler "exit" enabled [ 7.475923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.512803] sched_ext: BPF scheduler "exit" enabled [ 7.532924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.586809] sched_ext: BPF scheduler "exit" enabled [ 7.595926] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.661923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.723923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) ok 1 exit # ===== END ===== ============================= RESULTS: PASSED: 1 SKIPPED: 0 FAILED: 0 Fixes: f0e1a06 ("sched_ext: Implement BPF extensible scheduler class") Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Under certain kernel configurations when building with Clang/LLVM, the compiler does not generate a return or jump as the terminator instruction for ip_vs_protocol_init(), triggering the following objtool warning during build time: vmlinux.o: warning: objtool: ip_vs_protocol_init() falls through to next function __initstub__kmod_ip_vs_rr__935_123_ip_vs_rr_init6() At runtime, this either causes an oops when trying to load the ipvs module or a boot-time panic if ipvs is built-in. This same issue has been reported by the Intel kernel test robot previously. Digging deeper into both LLVM and the kernel code reveals this to be a undefined behavior problem. ip_vs_protocol_init() uses a on-stack buffer of 64 chars to store the registered protocol names and leaves it uninitialized after definition. The function calls strnlen() when concatenating protocol names into the buffer. With CONFIG_FORTIFY_SOURCE strnlen() performs an extra step to check whether the last byte of the input char buffer is a null character (commit 3009f89 ("fortify: Allow strlen() and strnlen() to pass compile-time known lengths")). This, together with possibly other configurations, cause the following IR to be generated: define hidden i32 @ip_vs_protocol_init() local_unnamed_addr #5 section ".init.text" align 16 !kcfi_type !29 { %1 = alloca [64 x i8], align 16 ... 14: ; preds = %11 %15 = getelementptr inbounds i8, ptr %1, i64 63 %16 = load i8, ptr %15, align 1 %17 = tail call i1 @llvm.is.constant.i8(i8 %16) %18 = icmp eq i8 %16, 0 %19 = select i1 %17, i1 %18, i1 false br i1 %19, label %20, label %23 20: ; preds = %14 %21 = call i64 @strlen(ptr noundef nonnull dereferenceable(1) %1) #23 ... 23: ; preds = %14, %11, %20 %24 = call i64 @strnlen(ptr noundef nonnull dereferenceable(1) %1, i64 noundef 64) #24 ... } The above code calculates the address of the last char in the buffer (value %15) and then loads from it (value %16). Because the buffer is never initialized, the LLVM GVN pass marks value %16 as undefined: %13 = getelementptr inbounds i8, ptr %1, i64 63 br i1 undef, label %14, label %17 This gives later passes (SCCP, in particular) more DCE opportunities by propagating the undef value further, and eventually removes everything after the load on the uninitialized stack location: define hidden i32 @ip_vs_protocol_init() local_unnamed_addr #0 section ".init.text" align 16 !kcfi_type !11 { %1 = alloca [64 x i8], align 16 ... 12: ; preds = %11 %13 = getelementptr inbounds i8, ptr %1, i64 63 unreachable } In this way, the generated native code will just fall through to the next function, as LLVM does not generate any code for the unreachable IR instruction and leaves the function without a terminator. Zero the on-stack buffer to avoid this possible UB. Fixes: 1da177e ("Linux-2.6.12-rc2") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402100205.PWXIz1ZK-lkp@intel.com/ Co-developed-by: Ruowen Qin <ruqin@redhat.com> Signed-off-by: Ruowen Qin <ruqin@redhat.com> Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull request for series with
subject: bpf, arm64: support more atomic ops
version: 4
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=615259