…pipeline stall
This patch will reduce cpu usage dramatically in kernel space especially
for application which use sys-call with large buffer size, such as
network applications. The main reason behind this is that every
unaligned memory access will raise exceptions and switch between s-mode
and m-mode causing large overhead.
First copy in bytes until reaches the first word aligned boundary in
destination memory address. This is the preparation before the bulk
aligned word copy.
The destination address is aligned now, but oftentimes the source
address is not in an aligned boundary. To reduce the unaligned memory
access, it reads the data from source in aligned boundaries, which will
cause the data to have an offset, and then combines the data in the next
iteration by fixing offset with shifting before writing to destination.
The majority of the improving copy speed comes from this shift copy.
In the lucky situation that the both source and destination address are
on the aligned boundary, perform load and store with register size to
copy the data. Without the unrolling, it will reduce the speed since the
next store instruction for the same register using from the load will
stall the pipeline. If the size of copy is too small for unrolled copy
perform regular word copy.
At last, copying the remainder in one byte at a time.
The motivation to create the patch was to improve network performance on
beaglev beta board. By observing with perf, the memcpy and
__asm_copy_to_user had heavy cpu usage and the network speed was limited
at around 680Mbps on 1Gbps lan.
Typical network applications use system calls with a large buffer on
send/recv() and sendto/recvfrom() for the optimization.
The bench result, when patching only copy_user. The memcpy is without
Matteo's patches but listing the both since they are the top two largest
overhead.
All results are from the same base kernel, same rootfs and same BeagleV
beta board.
Results of iperf3 have speedup on UDP with the copy_user patch alone.
--- UDP send ---
306 Mbits/sec 362 Mbits/sec
305 Mbits/sec 362 Mbits/sec
--- UDP recv ---
772 Mbits/sec 787 Mbits/sec
773 Mbits/sec 784 Mbits/sec
Comparison by "perf top -Ue task-clock" while running iperf3.
--- TCP recv ---
* Before
40.40% [kernel] [k] memcpy
33.09% [kernel] [k] __asm_copy_to_user
* With patch
50.35% [kernel] [k] memcpy
13.76% [kernel] [k] __asm_copy_to_user
--- TCP send ---
* Before
19.96% [kernel] [k] memcpy
9.84% [kernel] [k] __asm_copy_to_user
* With patch
14.27% [kernel] [k] memcpy
7.37% [kernel] [k] __asm_copy_to_user
--- UDP recv ---
* Before
44.45% [kernel] [k] memcpy
31.04% [kernel] [k] __asm_copy_to_user
* With patch
55.62% [kernel] [k] memcpy
11.22% [kernel] [k] __asm_copy_to_user
--- UDP send ---
* Before
25.18% [kernel] [k] memcpy
22.50% [kernel] [k] __asm_copy_to_user
* With patch
28.90% [kernel] [k] memcpy
9.49% [kernel] [k] __asm_copy_to_user
Signed-off-by: Akira Tsukamoto <akira.tsukamoto@gmail.com>