Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build : enable link-time optimizations #3859

Closed
wants to merge 5 commits into from
Closed

build : enable link-time optimizations #3859

wants to merge 5 commits into from

Conversation

ggerganov
Copy link
Owner

ref #3858

Try to restore the performance to what it was before the refactoring #3833
Seems like the ggml_fp16_to_fp32 and ggml_fp32_to_fp16 calls slow down the processing significantly. At least with ARM_NEON. Haven't confirmed for x86 architectures

@ggerganov
Copy link
Owner Author

@wro52 Could you test this branch and see if it fixes the performance in your environment?

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2023

For me with gcc 12.3.0 under Linux it doesn't seem to change anything either way, but also didn't #3833. But it does increase a full build time (with make) by ~20% and adds a lot of warnings from gcc.

model size params backend threads test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B CPU 16 tg 128 13.39 ± 0.37
llama 7B mostly F16 12.55 GiB 6.74 B CPU 8 tg 32 4.25 ± 0.04

build: 1206b5f (1446) build time: 27.87 secs

model size params backend threads test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B CPU 16 tg 128 13.30 ± 0.36
llama 7B mostly F16 12.55 GiB 6.74 B CPU 8 tg 32 4.23 ± 0.03

build: 6e08281 (1445) build time: 23.15 secs

model size params backend threads test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B CPU 16 tg 128 13.43 ± 0.05
llama 7B mostly F16 12.55 GiB 6.74 B CPU 8 tg 32 4.20 ± 0.04

build: 82a6646 (1440) build time: 22.41 secs

@Green-Sky
Copy link
Collaborator

adds a lot of warnings from gcc.

mostly duplicate symbols, right? I think I saw those when I activated LTO locally. Having the same, non-private, symbols or the same symbol names across multiple TU is bad practice anyway, so those warnings are correct.

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2023

No, it is just this warning repeated many times:

lto-wrapper: warning: using serial compilation of 3 LTRANS jobs
lto-wrapper: note: see the ‘-flto’ option documentation for more information

@ggerganov
Copy link
Owner Author

I also get these warnings - not sure how to fix.

Any alternatives?
Without LTO, on master I get:

model size params backend ngl threads test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 8 tg 128 19.03 ± 0.03

build: 6e08281 (1445)

With LTO and also before #3833 I get:

model size params backend ngl threads test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 0 8 tg 128 33.89 ± 0.12

build: a6aba2c (1448)

Almost 2x slowdown.

Also, on master with this patch the performance is restored:

diff --git a/ggml-quants.c b/ggml-quants.c
index fd4ee1be..5a5ed16f 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -6,6 +6,9 @@
 #include <assert.h>
 #include <float.h>
 
+#define ggml_fp16_to_fp32
+#define ggml_fp32_to_fp16
+
 #ifdef __ARM_NEON
 
 // if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:

But this works only for ARM_NEON where there is native F16 <-> F32 cast

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2023

The warnings disappear with -flto=auto:

diff --git a/Makefile b/Makefile
index 2a2ac850..348143e0 100644
--- a/Makefile
+++ b/Makefile
@@ -124,7 +124,7 @@ MK_CFLAGS        += -Ofast -flto
 MK_HOST_CXXFLAGS += -Ofast
 MK_CUDA_CXXFLAGS += -O3
 else
-MK_CFLAGS        += -O3 -flto
+MK_CFLAGS        += -O3 -flto=auto
 MK_CXXFLAGS      += -O3
 endif

@Green-Sky
Copy link
Collaborator

We could leave LTO OFF by default, like before, but set it ON in the ci. (not a real solution though)

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2023

I guess we could move the fp16 conversion functions to an internal header ggml-impl.h.

@slaren
Copy link
Collaborator

slaren commented Oct 30, 2023

The inlining issue also used to be a problem with early versions of ggml-cuda, and may still be with ggml-opencl since it inherited this code:

llama.cpp/ggml-opencl.cpp

Lines 1618 to 1623 in 6e08281

for (int64_t i11 = 0; i11 < ne11; i11++) {
for (int64_t i10 = 0; i10 < ne10; i10++) {
// very slow due to no inlining
tmp[i11*ne10 + i10] = ggml_fp32_to_fp16(*(float *) (src1i + i11*nb11 + i10*nb10));
}
}

@Green-Sky
Copy link
Collaborator

Green-Sky commented Oct 30, 2023

we might get away with something like -finline-limit=

since ggml uses a lot of loops, -faggressive-loop-optimizations sounds interesting too.

@ggerganov
Copy link
Owner Author

I guess we could move the fp16 conversion functions to an internal header ggml-impl.h.

This can work. Implemented here: #3861

@wro52
Copy link

wro52 commented Oct 30, 2023

@wro52 Could you test this branch and see if it fixes the performance in your environment?

Tried every possible speed setting - no significant influence

@ggerganov
Copy link
Owner Author

Merged #3861 instead

@ggerganov ggerganov closed this Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants