Skip to content

Conversation

duduta
Copy link
Contributor

@duduta duduta commented Sep 12, 2025

This PR optimizes the ggml norm operation.

  • use the ggml_vec_sum_f32 instead of summing in a loop
  • if available, use Accelerate to compute variance
  • implement ggml_vec_centered_variance_f32 using intrinsics to compute variance
  • add performance tests for norm into test-backend-ops

The implementation of ggml_vec_centered_variance_f32 mirrors
ggml_vec_soft_max_f32 for consistency.

I tested on an AVX2 ISA.

Device description: Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
Device memory: 16384 MB (16384 MB free)

Results from test-backend-ops perf -b CPU -o NORM

BEFORE:

  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000):              262144 runs -     3.82 us/run -       30 kB/run - 					7.48 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000):              843673 runs -     1.19 us/run -        2 kB/run - 					1.71 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001):              270336 runs -     3.78 us/run -       30 kB/run - 					7.58 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001):              851864 runs -     1.19 us/run -        2 kB/run - 					1.71 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100):              270336 runs -     3.75 us/run -       30 kB/run - 					7.64 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100):              909201 runs -     1.10 us/run -        2 kB/run - 					1.84 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000):              253952 runs -     3.94 us/run -       30 kB/run - 					7.25 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000):              909201 runs -     1.11 us/run -        2 kB/run - 					1.83 GB/s

AFTER:

  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000):              450560 runs -     2.22 us/run -       30 kB/run - 					12.89 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000):              851864 runs -     1.19 us/run -        2 kB/run - 					1.70 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001):              450560 runs -     2.23 us/run -       30 kB/run - 					12.86 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001):              991111 runs -     1.01 us/run -        2 kB/run - 					2.00 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100):              458752 runs -     2.20 us/run -       30 kB/run - 					12.99 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100):              991111 runs -     1.02 us/run -        2 kB/run - 					1.99 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000):              450560 runs -     2.25 us/run -       30 kB/run - 					12.74 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000):              974729 runs -     1.03 us/run -        2 kB/run - 					1.96 GB/s

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Sep 12, 2025
@duduta duduta force-pushed the optimize-ggml-cpu-norm branch from c38f290 to df16d10 Compare September 13, 2025 15:27
@duduta duduta requested a review from taronaeo September 13, 2025 20:59
@duduta duduta force-pushed the optimize-ggml-cpu-norm branch from af54a93 to 2853109 Compare September 22, 2025 11:53
@duduta duduta force-pushed the optimize-ggml-cpu-norm branch 2 times, most recently from 654f1e6 to fc759be Compare September 22, 2025 14:46
@duduta
Copy link
Contributor Author

duduta commented Sep 22, 2025

Thank you @ggerganov for your review. I applied your suggestions and rebased.

@duduta duduta requested a review from slaren as a code owner September 22, 2025 15:34
@CISC
Copy link
Collaborator

CISC commented Oct 7, 2025

This looks ready for merge, forgotten?

@taronaeo
Copy link
Collaborator

taronaeo commented Oct 7, 2025

Was actually waiting to see if @/slaren had any comments on this since he is the codeowner. But yeah, if no further comments i'll merge it tomorrow morning.

@duduta duduta force-pushed the optimize-ggml-cpu-norm branch from 04d56f9 to 31eb135 Compare October 7, 2025 16:23
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor whitespace cleanup

@duduta duduta force-pushed the optimize-ggml-cpu-norm branch from 7dae677 to 7e986ec Compare October 8, 2025 10:36
duduta and others added 3 commits October 8, 2025 13:38
          rename function

          add endif macro comment

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
@CISC
Copy link
Collaborator

CISC commented Oct 9, 2025

@duduta Please re-apply the whitespace cleanup suggestions I just unresolved, then we are good to merge I think.

@CISC CISC merged commit 1deee0f into ggml-org:master Oct 9, 2025
69 checks passed
anyshu pushed a commit to anyshu/llama.cpp that referenced this pull request Oct 10, 2025
* master: (113 commits)
  webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489)
  cpu : optimize the ggml NORM operation (ggml-org#15953)
  server : host-memory prompt caching (ggml-org#16391)
  No markdown in cot (ggml-org#16483)
  model-conversion : add support for SentenceTransformers (ggml-org#16387)
  ci: add ARM64 Kleidiai build and test support (ggml-org#16462)
  CANN: Improve ACL graph matching (ggml-org#16166)
  kleidiai: kernel interface refactoring (ggml-org#16460)
  [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472)
  model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367)
  refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394)
  Disable CUDA host buffers on integrated GPUs (ggml-org#16308)
  server : fix cancel pending task (ggml-org#16467)
  metal : mark FA blocks (ggml-org#16372)
  server : improve context checkpoint logic (ggml-org#16440)
  ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452)
  llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464)
  server : add `/v1/health` endpoint (ggml-org#16461)
  webui : added download action (ggml-org#13552) (ggml-org#16282)
  presets : fix pooling param for embedding models (ggml-org#16455)
  ...
@LostRuins
Copy link
Collaborator

LostRuins commented Oct 13, 2025

Hello, it seems like this PR causes degradation when used with TTS.cpp running Kokoro. Reverting all changes done to ops.cpp in this commit solves the issue, so I would suspect that there may be some scenarios where it is returning significantly different outputs than before.

Using an Intel i9 13980hx CPU (avx2 enabled, no avx512)

Audio Before:

before.mp4

Audio After:

after.mp4

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Oct 13, 2025
@ggerganov
Copy link
Member

@LostRuins Should be fixed in #16558

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Oct 13, 2025
@LostRuins
Copy link
Collaborator

Thanks, seems to be working from a quick test

@duduta
Copy link
Contributor Author

duduta commented Oct 13, 2025

Sorry, @LostRuins , thanks @ggerganov for fixing this

@LostRuins
Copy link
Collaborator

all good, i'll let you know if any other issues come up.

@duduta duduta deleted the optimize-ggml-cpu-norm branch October 14, 2025 05:20
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* ggml-cpu: optimize norm operation to use intrinsics or Accelerate

          rename function

          add endif macro comment

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* implement s390x SIMD suggested by @taronaeo

* add TODO comment

* tidy up spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants