Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use -march=native -mtune=native on x86 (Also Enables AVX512 on macOS) #609

Merged
merged 1 commit into from
Apr 2, 2023

Conversation

cmdrf
Copy link
Contributor

@cmdrf cmdrf commented Mar 29, 2023

On my 2019 Mac Pro I have these CPU features:

$ sysctl machdep.cpu.leaf7_features
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 FDPEO SMEP BMI2 ERMS INVPCID PQM FPU_CSDS MPX PQE AVX512F AVX512DQ RDSEED ADX SMAP CLFSOPT CLWB IPT AVX512CD AVX512BW AVX512VL PKU AVX512VNNI MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD

Although I was wondering: Why not use -march=native?

EDIT:

Using -march=native -mtune=native on x86 now. Could potentially be extended to other architectures, although the meaning of -march, -mtune and -mcpu is a bit convoluted across different architectures.

@slaren
Copy link
Member

slaren commented Mar 30, 2023

Although I was wondering: Why not use -march=native?

That's a good question. I cannot see any disadvantages on my system, and it significantly simplifies the Makefile.

@Ameobea
Copy link
Contributor

Ameobea commented Mar 31, 2023

@cmdrf would you be able to do a little performance comparison with and without these changes on your system? I'd be very curious to see what kind of benefit you see.

We've seen rather small improvements (~10% or so at best) on desktop systems with AMD and Intel CPUs, and I'd be interested to see if those benefits translate through to a laptop.

@cmdrf
Copy link
Contributor Author

cmdrf commented Mar 31, 2023

Sure, here you go! Although it's also a desktop system (Xeon W-3223):
Invocation:

./main -m ./models/65B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 -s 123 -t 16

Without AVX512:

llama_print_timings:        load time = 11347.83 ms
llama_print_timings:      sample time =   604.25 ms /   512 runs   (    1.18 ms per run)
llama_print_timings: prompt eval time = 89523.36 ms /   271 tokens (  330.34 ms per token)
llama_print_timings:        eval time = 931716.51 ms /   510 runs   ( 1826.90 ms per run)
llama_print_timings:       total time = 1023737.14 ms

With AVX512:

llama_print_timings:        load time = 10401.94 ms
llama_print_timings:      sample time =   617.68 ms /   512 runs   (    1.21 ms per run)
llama_print_timings: prompt eval time = 87499.03 ms /   271 tokens (  322.87 ms per token)
llama_print_timings:        eval time = 897070.17 ms /   510 runs   ( 1758.96 ms per run)
llama_print_timings:       total time = 987099.97 ms

Which translates to an improvement of 3.7%, taking the total time into account.

Btw: I passed in the same RNG seed in both runs and was expecting to get the same output, but it was different.

@Ameobea
Copy link
Contributor

Ameobea commented Mar 31, 2023

That is great to see!

I've heard stories about AVX512 on certain Intel CPUs hurting performance due to throttling or similar, and I'm happy to see that's not the case for this particular application.

Tyvm for benchmarking

@cmdrf
Copy link
Contributor Author

cmdrf commented Mar 31, 2023

I did the same benchmark with -march=native -mtune=native, and it is now 8.3% faster than the original! I'll change the PR accordingly.

One interesting thing I noticed: The output was now exactly the same as with my other AVX512 run. This implies that the output is not only dependent on the prompt and the RNG seed, but also on AVX512 being enabled or not. Could this potentially be a bug in the hand-crafted routines for AVX512? Or is this to be expected?

@cmdrf cmdrf changed the title Enable AVX512 on macOS Use -march=native -mtune=native on x86 (Also Enables AVX512 on macOS) Mar 31, 2023
@slaren
Copy link
Member

slaren commented Mar 31, 2023

The output was now exactly the same as with my other AVX512 run. This implies that the output is not only dependent on the prompt and the RNG seed, but also on AVX512 being enabled or not. Could this potentially be a bug in the hand-crafted routines for AVX512? Or is this to be expected?

It is expected, not every path uses the exact same floating point operations or in the same order and that may result in slightly different results. As long as the generation quality (as calculated by the perplexity) isn't affected this is not an issue.

@ggerganov
Copy link
Member

Although I was wondering: Why not use -march=native?

There is no good reason for not using it.
I somehow thought it's ok to be explicit with these flags, but the experience is that this causes more problems than anything.

So yes, let's start using -march=native from now on.

Regarding different results - correct.
Since we are dealing with floating point numbers and various storage formats depending on the CPU flags, we will observe variability in the results. It's not ideal, but it's not a bug. Maybe in the future we can make the computations reproducible across different platforms. Not sure how difficult that would be though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants