build : enable link-time optimizations #3859

ggerganov · 2023-10-30T13:14:52Z

Try to restore the performance to what it was before the refactoring #3833
Seems like the ggml_fp16_to_fp32 and ggml_fp32_to_fp16 calls slow down the processing significantly. At least with ARM_NEON. Haven't confirmed for x86 architectures

ggerganov · 2023-10-30T13:16:03Z

@wro52 Could you test this branch and see if it fixes the performance in your environment?

slaren · 2023-10-30T13:47:30Z

For me with gcc 12.3.0 under Linux it doesn't seem to change anything either way, but also didn't #3833. But it does increase a full build time (with make) by ~20% and adds a lot of warnings from gcc.

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	tg 128	13.39 ± 0.37
llama 7B mostly F16	12.55 GiB	6.74 B	CPU	8	tg 32	4.25 ± 0.04

build: 1206b5f (1446) build time: 27.87 secs

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	tg 128	13.30 ± 0.36
llama 7B mostly F16	12.55 GiB	6.74 B	CPU	8	tg 32	4.23 ± 0.03

build: 6e08281 (1445) build time: 23.15 secs

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	16	tg 128	13.43 ± 0.05
llama 7B mostly F16	12.55 GiB	6.74 B	CPU	8	tg 32	4.20 ± 0.04

build: 82a6646 (1440) build time: 22.41 secs

Green-Sky · 2023-10-30T13:49:11Z

adds a lot of warnings from gcc.

mostly duplicate symbols, right? I think I saw those when I activated LTO locally. Having the same, non-private, symbols or the same symbol names across multiple TU is bad practice anyway, so those warnings are correct.

slaren · 2023-10-30T13:51:12Z

No, it is just this warning repeated many times:

lto-wrapper: warning: using serial compilation of 3 LTRANS jobs
lto-wrapper: note: see the ‘-flto’ option documentation for more information

ggerganov · 2023-10-30T13:57:35Z

I also get these warnings - not sure how to fix.

Any alternatives?
Without LTO, on master I get:

model	size	params	backend	ngl	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	0	8	tg 128	19.03 ± 0.03

build: 6e08281 (1445)

With LTO and also before #3833 I get:

model	size	params	backend	ngl	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	0	8	tg 128	33.89 ± 0.12

build: a6aba2c (1448)

Almost 2x slowdown.

Also, on master with this patch the performance is restored:

diff --git a/ggml-quants.c b/ggml-quants.c
index fd4ee1be..5a5ed16f 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -6,6 +6,9 @@
 #include <assert.h>
 #include <float.h>
 
+#define ggml_fp16_to_fp32
+#define ggml_fp32_to_fp16
+
 #ifdef __ARM_NEON
 
 // if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:

But this works only for ARM_NEON where there is native F16 <-> F32 cast

slaren · 2023-10-30T13:59:04Z

The warnings disappear with -flto=auto:

diff --git a/Makefile b/Makefile
index 2a2ac850..348143e0 100644
--- a/Makefile
+++ b/Makefile
@@ -124,7 +124,7 @@ MK_CFLAGS        += -Ofast -flto
 MK_HOST_CXXFLAGS += -Ofast
 MK_CUDA_CXXFLAGS += -O3
 else
-MK_CFLAGS        += -O3 -flto
+MK_CFLAGS        += -O3 -flto=auto
 MK_CXXFLAGS      += -O3
 endif

Green-Sky · 2023-10-30T14:00:11Z

We could leave LTO OFF by default, like before, but set it ON in the ci. (not a real solution though)

slaren · 2023-10-30T14:01:16Z

I guess we could move the fp16 conversion functions to an internal header ggml-impl.h.

slaren · 2023-10-30T14:11:09Z

The inlining issue also used to be a problem with early versions of ggml-cuda, and may still be with ggml-opencl since it inherited this code:

llama.cpp/ggml-opencl.cpp

Lines 1618 to 1623 in 6e08281

    
           for (int64_t i11 = 0; i11 < ne11; i11++) { 
        
               for (int64_t i10 = 0; i10 < ne10; i10++) { 
        
                   // very slow due to no inlining 
        
                   tmp[i11*ne10 + i10] = ggml_fp32_to_fp16(*(float *) (src1i + i11*nb11 + i10*nb10)); 
        
               } 
        
           }

Green-Sky · 2023-10-30T14:18:02Z

we might get away with something like -finline-limit=

since ggml uses a lot of loops, -faggressive-loop-optimizations sounds interesting too.

ggerganov · 2023-10-30T14:43:40Z

I guess we could move the fp16 conversion functions to an internal header ggml-impl.h.

This can work. Implemented here: #3861

wro52 · 2023-10-30T16:12:25Z

@wro52 Could you test this branch and see if it fixes the performance in your environment?

Tried every possible speed setting - no significant influence

ggerganov · 2023-10-30T17:19:53Z

Merged #3861 instead

build : enable link-time optimizations

1206b5f

ggerganov added 2 commits October 30, 2023 15:40

build : disable lto for C++ (make) and enable existing LTO flag (cmake)

6f6b0db

ci : try to fix code coverage build

a6aba2c

ggerganov force-pushed the lto branch from e245f6c to a6aba2c Compare October 30, 2023 13:46

ci : fix focal build

57c4296

make : use -lfto=auto to avoid warnings and maintain perf

bc28aaa

ggerganov closed this Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build : enable link-time optimizations #3859

build : enable link-time optimizations #3859

ggerganov commented Oct 30, 2023

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023 •

edited

Loading

Green-Sky commented Oct 30, 2023

slaren commented Oct 30, 2023

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023

Green-Sky commented Oct 30, 2023

slaren commented Oct 30, 2023

slaren commented Oct 30, 2023

Green-Sky commented Oct 30, 2023 •

edited

Loading

ggerganov commented Oct 30, 2023

wro52 commented Oct 30, 2023

ggerganov commented Oct 30, 2023

build : enable link-time optimizations #3859

build : enable link-time optimizations #3859

Conversation

ggerganov commented Oct 30, 2023

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023 • edited Loading

Green-Sky commented Oct 30, 2023

slaren commented Oct 30, 2023

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023

Green-Sky commented Oct 30, 2023

slaren commented Oct 30, 2023

slaren commented Oct 30, 2023

Green-Sky commented Oct 30, 2023 • edited Loading

ggerganov commented Oct 30, 2023

wro52 commented Oct 30, 2023

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023 •

edited

Loading

Green-Sky commented Oct 30, 2023 •

edited

Loading