[libc] Optimize mempcy size thresholds #70049

dvyukov · 2023-10-24T14:27:07Z

Adjust boundary conditions for sizes = 16/32/64.
See the added comment for explanations.

Results on a machine with AVX2, so sizes 64/128 affected:

                │   baseline   │               adjusted               │
                │    sec/op    │   sec/op     vs base                 │
memcpy/Google_A   5.701n ±  0%   5.551n ± 1%   -2.63% (n=100)
memcpy/Google_B   3.817n ±  0%   3.776n ± 0%   -1.07% (p=0.000 n=100)
memcpy/Google_D   11.35n ±  1%   11.32n ± 0%        ~ (p=0.066 n=100)
memcpy/Google_U   3.874n ± 1%    3.821n ± 1%   -1.37% (p=0.001 n=100)
memcpy/64         3.843n ±  0%   3.105n ± 3%  -19.22% (n=50)
memcpy/128        4.842n ±  0%   3.818n ± 0%  -21.15% (p=0.000 n=50)

llvmbot · 2023-10-24T14:28:25Z

@llvm/pr-subscribers-libc

Author: Dmitry Vyukov (dvyukov)

Changes

Adjust boundary conditions for sizes = 16/32/64.
See the added comment for explanations.

Results on a machine with AVX2, so sizes 64/128 affected:

                │  baseline   │              pow2-sizes              │
                │   sec/op    │   sec/op     vs base                 │
memcpy/Google_B   3.775n ± 1%   3.739n ± 1%   -0.96% (p=0.000 n=100)
memcpy/Google_D   11.32n ± 1%   11.21n ± 1%   -0.97% (p=0.000 n=100)
memcpy/Google_A   5.450n ± 1%   5.428n ± 1%        ~ (p=0.058 n=100)
memcpy/Google_U   3.856n ± 1%   3.826n ± 1%   -0.79% (p=0.015 n=100)
memcpy/64         3.544n ± 0%   3.258n ± 0%   -8.09% (n=100)
memcpy/128        4.747n ± 0%   3.556n ± 0%  -25.10% (n=100)

Full diff: https://github.com/llvm/llvm-project/pull/70049.diff

1 Files Affected:

(modified) libc/src/string/memory_utils/x86_64/inline_memcpy.h (+25-7)

diff --git a/libc/src/string/memory_utils/x86_64/inline_memcpy.h b/libc/src/string/memory_utils/x86_64/inline_memcpy.h
index f43230ffd8ad125..81c995fff51d6d6 100644
--- a/libc/src/string/memory_utils/x86_64/inline_memcpy.h
+++ b/libc/src/string/memory_utils/x86_64/inline_memcpy.h
@@ -55,7 +55,7 @@ LIBC_INLINE_VAR constexpr size_t kRepMovsbThreshold =
 [[maybe_unused]] LIBC_INLINE void
 inline_memcpy_x86_sse2_ge64(Ptr __restrict dst, CPtr __restrict src,
                             size_t count) {
-  if (count < 128)
+  if (count <= 128)
     return builtin::Memcpy<64>::head_tail(dst, src, count);
   builtin::Memcpy<32>::block(dst, src);
   align_to_next_boundary<32, Arg::Dst>(dst, src, count);
@@ -65,7 +65,7 @@ inline_memcpy_x86_sse2_ge64(Ptr __restrict dst, CPtr __restrict src,
 [[maybe_unused]] LIBC_INLINE void
 inline_memcpy_x86_avx_ge64(Ptr __restrict dst, CPtr __restrict src,
                            size_t count) {
-  if (count < 128)
+  if (count <= 128)
     return builtin::Memcpy<64>::head_tail(dst, src, count);
   if (count < 256)
     return builtin::Memcpy<128>::head_tail(dst, src, count);
@@ -79,7 +79,7 @@ inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
                                            CPtr __restrict src, size_t count) {
   using namespace LIBC_NAMESPACE::x86;
   prefetch_to_local_cache(src + kOneCacheline);
-  if (count < 128)
+  if (count <= 128)
     return builtin::Memcpy<64>::head_tail(dst, src, count);
   prefetch_to_local_cache(src + kTwoCachelines);
   // Aligning 'dst' on a 32B boundary.
@@ -120,7 +120,7 @@ inline_memcpy_x86_avx_ge64_sw_prefetching(Ptr __restrict dst,
                                           CPtr __restrict src, size_t count) {
   using namespace LIBC_NAMESPACE::x86;
   prefetch_to_local_cache(src + kOneCacheline);
-  if (count < 128)
+  if (count <= 128)
     return builtin::Memcpy<64>::head_tail(dst, src, count);
   prefetch_to_local_cache(src + kTwoCachelines);
   prefetch_to_local_cache(src + kThreeCachelines);
@@ -149,6 +149,15 @@ inline_memcpy_x86_avx_ge64_sw_prefetching(Ptr __restrict dst,
 
 [[maybe_unused]] LIBC_INLINE void
 inline_memcpy_x86(Ptr __restrict dst, CPtr __restrict src, size_t count) {
+#if defined(__AVX512F__)
+  constexpr size_t vector_size = 64;
+#elif defined(__AVX__)
+  constexpr size_t vector_size = 32;
+#elif defined(__SSE2__)
+  constexpr size_t vector_size = 16;
+#else
+  constexpr size_t vector_size = 8;
+#endif
   if (count == 0)
     return;
   if (count == 1)
@@ -161,11 +170,20 @@ inline_memcpy_x86(Ptr __restrict dst, CPtr __restrict src, size_t count) {
     return builtin::Memcpy<4>::block(dst, src);
   if (count < 8)
     return builtin::Memcpy<4>::head_tail(dst, src, count);
-  if (count < 16)
+  // If count is equal to a power of 2, we can handle it as head-tail
+  // of both smaller size and larger size (head-tail are either
+  // non-overlapping for smaller size, or completely collapsed
+  // for larger size). It seems to be more profitable to do the copy
+  // with the larger size, if it's natively supported (e.g. doing
+  // 2 collapsed 32-byte moves for count=64 if AVX2 is supported).
+  // But it's not profitable to use larger size if it's not natively
+  // supported: we will both use more instructions and handle fewer
+  // sizes in earlier branches.
+  if (count < 16 + (vector_size <= 8))
     return builtin::Memcpy<8>::head_tail(dst, src, count);
-  if (count < 32)
+  if (count < 32 + (vector_size <= 16))
     return builtin::Memcpy<16>::head_tail(dst, src, count);
-  if (count < 64)
+  if (count < 64 + (vector_size <= 32))
     return builtin::Memcpy<32>::head_tail(dst, src, count);
   if constexpr (x86::kAvx) {
     if constexpr (x86::kUseSoftwarePrefetching) {

Adjust boundary conditions for sizes = 16/32/64. See the added comment for explanations. Results on a machine with AVX2, so sizes 64/128 affected: │ baseline │ adjusted │ │ sec/op │ sec/op vs base │ memcpy/Google_A 5.701n ± 0% 5.551n ± 1% -2.63% (n=100) memcpy/Google_B 3.817n ± 0% 3.776n ± 0% -1.07% (p=0.000 n=100) memcpy/Google_D 11.35n ± 1% 11.32n ± 0% ~ (p=0.066 n=100) memcpy/Google_U 3.874n ± 1% 3.821n ± 1% -1.37% (p=0.001 n=100) memcpy/64 3.843n ± 0% 3.105n ± 3% -19.22% (n=50) memcpy/128 4.842n ± 0% 3.818n ± 0% -21.15% (p=0.000 n=50)

dvyukov · 2023-10-28T05:36:18Z

I've updated the change to use the same approach as memmove.
PTAL

gchatelet

LGTM

dvyukov requested a review from gchatelet October 24, 2023 14:27

llvmbot added the libc label Oct 24, 2023

dvyukov force-pushed the dvyukov-memcpy-sizes branch from 0029ba7 to 4092097 Compare October 28, 2023 05:35

gchatelet approved these changes Nov 6, 2023

View reviewed changes

dvyukov merged commit d275277 into llvm:main Nov 7, 2023

This was referenced Nov 8, 2023

fix empty bb maksfb/llvm-project#2

Closed

fix empty bb maksfb/llvm-project#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libc] Optimize mempcy size thresholds #70049

[libc] Optimize mempcy size thresholds #70049

dvyukov commented Oct 24, 2023 •

edited

Loading

llvmbot commented Oct 24, 2023

dvyukov commented Oct 28, 2023

gchatelet left a comment

[libc] Optimize mempcy size thresholds #70049

[libc] Optimize mempcy size thresholds #70049

Conversation

dvyukov commented Oct 24, 2023 • edited Loading

llvmbot commented Oct 24, 2023

dvyukov commented Oct 28, 2023

gchatelet left a comment

Choose a reason for hiding this comment

dvyukov commented Oct 24, 2023 •

edited

Loading