Fix: __asm_copy_to-from_user() having overrun copy #31

This reverts commit 36e35cd.

…ess and pipeline stall" This reverts commit ca6eaaa.

…pipeline stall This patch will reduce cpu usage dramatically in kernel space especially for application which use sys-call with large buffer size, such as network applications. The main reason behind this is that every unaligned memory access will raise exceptions and switch between s-mode and m-mode causing large overhead. First copy in bytes until reaches the first word aligned boundary in destination memory address. This is the preparation before the bulk aligned word copy. The destination address is aligned now, but oftentimes the source address is not in an aligned boundary. To reduce the unaligned memory access, it reads the data from source in aligned boundaries, which will cause the data to have an offset, and then combines the data in the next iteration by fixing offset with shifting before writing to destination. The majority of the improving copy speed comes from this shift copy. In the lucky situation that the both source and destination address are on the aligned boundary, perform load and store with register size to copy the data. Without the unrolling, it will reduce the speed since the next store instruction for the same register using from the load will stall the pipeline. If the size of copy is too small for unrolled copy perform regular word copy. At last, copying the remainder in one byte at a time. The motivation to create the patch was to improve network performance on beaglev beta board. By observing with perf, the memcpy and __asm_copy_to_user had heavy cpu usage and the network speed was limited at around 680Mbps on 1Gbps lan. Typical network applications use system calls with a large buffer on send/recv() and sendto/recvfrom() for the optimization. The bench result, when patching only copy_user. The memcpy is without Matteo's patches but listing the both since they are the top two largest overhead. All results are from the same base kernel, same rootfs and same BeagleV beta board. Results of iperf3 have speedup on UDP with the copy_user patch alone. --- UDP send --- 306 Mbits/sec 362 Mbits/sec 305 Mbits/sec 362 Mbits/sec --- UDP recv --- 772 Mbits/sec 787 Mbits/sec 773 Mbits/sec 784 Mbits/sec Comparison by "perf top -Ue task-clock" while running iperf3. --- TCP recv --- * Before 40.40% [kernel] [k] memcpy 33.09% [kernel] [k] __asm_copy_to_user * With patch 50.35% [kernel] [k] memcpy 13.76% [kernel] [k] __asm_copy_to_user --- TCP send --- * Before 19.96% [kernel] [k] memcpy 9.84% [kernel] [k] __asm_copy_to_user * With patch 14.27% [kernel] [k] memcpy 7.37% [kernel] [k] __asm_copy_to_user --- UDP recv --- * Before 44.45% [kernel] [k] memcpy 31.04% [kernel] [k] __asm_copy_to_user * With patch 55.62% [kernel] [k] memcpy 11.22% [kernel] [k] __asm_copy_to_user --- UDP send --- * Before 25.18% [kernel] [k] memcpy 22.50% [kernel] [k] __asm_copy_to_user * With patch 28.90% [kernel] [k] memcpy 9.49% [kernel] [k] __asm_copy_to_user Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: __asm_copy_to-from_user() having overrun copy #31

Fix: __asm_copy_to-from_user() having overrun copy #31

Commits on Jul 19, 2021