Use -march=native -mtune=native on x86 (Also Enables AVX512 on macOS) #609

cmdrf · 2023-03-29T22:21:34Z

On my 2019 Mac Pro I have these CPU features:

$ sysctl machdep.cpu.leaf7_features
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 FDPEO SMEP BMI2 ERMS INVPCID PQM FPU_CSDS MPX PQE AVX512F AVX512DQ RDSEED ADX SMAP CLFSOPT CLWB IPT AVX512CD AVX512BW AVX512VL PKU AVX512VNNI MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD

Although I was wondering: Why not use -march=native?

EDIT:

Using -march=native -mtune=native on x86 now. Could potentially be extended to other architectures, although the meaning of -march, -mtune and -mcpu is a bit convoluted across different architectures.

slaren · 2023-03-30T21:20:47Z

Although I was wondering: Why not use -march=native?

That's a good question. I cannot see any disadvantages on my system, and it significantly simplifies the Makefile.

Ameobea · 2023-03-31T00:45:52Z

@cmdrf would you be able to do a little performance comparison with and without these changes on your system? I'd be very curious to see what kind of benefit you see.

We've seen rather small improvements (~10% or so at best) on desktop systems with AMD and Intel CPUs, and I'd be interested to see if those benefits translate through to a laptop.

cmdrf · 2023-03-31T08:23:16Z

Sure, here you go! Although it's also a desktop system (Xeon W-3223):
Invocation:

./main -m ./models/65B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 -s 123 -t 16

Without AVX512:

llama_print_timings:        load time = 11347.83 ms
llama_print_timings:      sample time =   604.25 ms /   512 runs   (    1.18 ms per run)
llama_print_timings: prompt eval time = 89523.36 ms /   271 tokens (  330.34 ms per token)
llama_print_timings:        eval time = 931716.51 ms /   510 runs   ( 1826.90 ms per run)
llama_print_timings:       total time = 1023737.14 ms

With AVX512:

llama_print_timings:        load time = 10401.94 ms
llama_print_timings:      sample time =   617.68 ms /   512 runs   (    1.21 ms per run)
llama_print_timings: prompt eval time = 87499.03 ms /   271 tokens (  322.87 ms per token)
llama_print_timings:        eval time = 897070.17 ms /   510 runs   ( 1758.96 ms per run)
llama_print_timings:       total time = 987099.97 ms

Which translates to an improvement of 3.7%, taking the total time into account.

Btw: I passed in the same RNG seed in both runs and was expecting to get the same output, but it was different.

Ameobea · 2023-03-31T08:25:39Z

That is great to see!

I've heard stories about AVX512 on certain Intel CPUs hurting performance due to throttling or similar, and I'm happy to see that's not the case for this particular application.

Tyvm for benchmarking

cmdrf · 2023-03-31T09:09:01Z

I did the same benchmark with -march=native -mtune=native, and it is now 8.3% faster than the original! I'll change the PR accordingly.

One interesting thing I noticed: The output was now exactly the same as with my other AVX512 run. This implies that the output is not only dependent on the prompt and the RNG seed, but also on AVX512 being enabled or not. Could this potentially be a bug in the hand-crafted routines for AVX512? Or is this to be expected?

slaren · 2023-03-31T13:43:55Z

The output was now exactly the same as with my other AVX512 run. This implies that the output is not only dependent on the prompt and the RNG seed, but also on AVX512 being enabled or not. Could this potentially be a bug in the hand-crafted routines for AVX512? Or is this to be expected?

It is expected, not every path uses the exact same floating point operations or in the same order and that may result in slightly different results. As long as the generation quality (as calculated by the perplexity) isn't affected this is not an issue.

ggerganov · 2023-04-02T07:16:32Z

Although I was wondering: Why not use -march=native?

There is no good reason for not using it.
I somehow thought it's ok to be explicit with these flags, but the experience is that this causes more problems than anything.

So yes, let's start using -march=native from now on.

Regarding different results - correct.
Since we are dealing with floating point numbers and various storage formats depending on the CPU flags, we will observe variability in the results. It's not ideal, but it's not a bug. Maybe in the future we can make the computations reproducible across different platforms. Not sure how difficult that would be though.

Use -march=native -mtune=native on x86. Also enables AVX512 on macOS.

40c7dd1

cmdrf force-pushed the macos-avx512 branch from 8ec6610 to 40c7dd1 Compare March 31, 2023 09:21

cmdrf changed the title ~~Enable AVX512 on macOS~~ Use -march=native -mtune=native on x86 (Also Enables AVX512 on macOS) Mar 31, 2023

ggerganov approved these changes Apr 2, 2023

View reviewed changes

ggerganov merged commit c4f89d8 into ggml-org:master Apr 2, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use -march=native -mtune=native on x86 (Also Enables AVX512 on macOS) #609

Use -march=native -mtune=native on x86 (Also Enables AVX512 on macOS) #609

cmdrf commented Mar 29, 2023 •

edited

Loading

slaren commented Mar 30, 2023 •

edited

Loading

Ameobea commented Mar 31, 2023

cmdrf commented Mar 31, 2023

Ameobea commented Mar 31, 2023

cmdrf commented Mar 31, 2023

slaren commented Mar 31, 2023

ggerganov commented Apr 2, 2023

Use -march=native -mtune=native on x86 (Also Enables AVX512 on macOS) #609

Use -march=native -mtune=native on x86 (Also Enables AVX512 on macOS) #609

Conversation

cmdrf commented Mar 29, 2023 • edited Loading

slaren commented Mar 30, 2023 • edited Loading

Ameobea commented Mar 31, 2023

cmdrf commented Mar 31, 2023

Ameobea commented Mar 31, 2023

cmdrf commented Mar 31, 2023

slaren commented Mar 31, 2023

ggerganov commented Apr 2, 2023

cmdrf commented Mar 29, 2023 •

edited

Loading

slaren commented Mar 30, 2023 •

edited

Loading