ggml-cpu: optimize the ggml NORM operation #15953

duduta · 2025-09-12T16:41:43Z

This PR optimizes the ggml norm operation.

use the ggml_vec_sum_f32 instead of summing in a loop
if available, use Accelerate to compute variance
implement ggml_vec_centered_variance_f32 using intrinsics to compute variance
add performance tests for norm into test-backend-ops

The implementation of ggml_vec_centered_variance_f32 mirrors
ggml_vec_soft_max_f32 for consistency.

I tested on an AVX2 ISA.

Device description: Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
Device memory: 16384 MB (16384 MB free)

Results from test-backend-ops perf -b CPU -o NORM

BEFORE:

  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000):              262144 runs -     3.82 us/run -       30 kB/run - 					7.48 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000):              843673 runs -     1.19 us/run -        2 kB/run - 					1.71 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001):              270336 runs -     3.78 us/run -       30 kB/run - 					7.58 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001):              851864 runs -     1.19 us/run -        2 kB/run - 					1.71 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100):              270336 runs -     3.75 us/run -       30 kB/run - 					7.64 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100):              909201 runs -     1.10 us/run -        2 kB/run - 					1.84 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000):              253952 runs -     3.94 us/run -       30 kB/run - 					7.25 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000):              909201 runs -     1.11 us/run -        2 kB/run - 					1.83 GB/s

AFTER:

  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000000):              450560 runs -     2.22 us/run -       30 kB/run - 					12.89 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000000):              851864 runs -     1.19 us/run -        2 kB/run - 					1.70 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000001):              450560 runs -     2.23 us/run -       30 kB/run - 					12.86 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000001):              991111 runs -     1.01 us/run -        2 kB/run - 					2.00 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.000100):              458752 runs -     2.20 us/run -       30 kB/run - 					12.99 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.000100):              991111 runs -     1.02 us/run -        2 kB/run - 					1.99 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=0,eps=0.100000):              450560 runs -     2.25 us/run -       30 kB/run - 					12.74 GB/s
  NORM(type=f32,ne=[64,5,4,3],v=1,eps=0.100000):              974729 runs -     1.03 us/run -        2 kB/run - 					1.96 GB/s

ggml/src/ggml-cpu/vec.cpp

ggml/src/ggml-cpu/vec.h

tests/test-backend-ops.cpp

duduta · 2025-09-22T15:07:00Z

Thank you @ggerganov for your review. I applied your suggestions and rebased.

ggml/src/ggml-cpu/vec.cpp

CISC · 2025-10-07T11:27:46Z

This looks ready for merge, forgotten?

taronaeo · 2025-10-07T15:36:54Z

Was actually waiting to see if @/slaren had any comments on this since he is the codeowner. But yeah, if no further comments i'll merge it tomorrow morning.

tests/test-backend-ops.cpp

ggerganov

Minor whitespace cleanup

ggml/src/ggml-cpu/ops.cpp

ggml/src/ggml-cpu/vec.cpp

rename function add endif macro comment Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>

CISC · 2025-10-09T12:07:22Z

@duduta Please re-apply the whitespace cleanup suggestions I just unresolved, then we are good to merge I think.

* master: (113 commits) webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489) cpu : optimize the ggml NORM operation (ggml-org#15953) server : host-memory prompt caching (ggml-org#16391) No markdown in cot (ggml-org#16483) model-conversion : add support for SentenceTransformers (ggml-org#16387) ci: add ARM64 Kleidiai build and test support (ggml-org#16462) CANN: Improve ACL graph matching (ggml-org#16166) kleidiai: kernel interface refactoring (ggml-org#16460) [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472) model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367) refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394) Disable CUDA host buffers on integrated GPUs (ggml-org#16308) server : fix cancel pending task (ggml-org#16467) metal : mark FA blocks (ggml-org#16372) server : improve context checkpoint logic (ggml-org#16440) ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452) llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464) server : add `/v1/health` endpoint (ggml-org#16461) webui : added download action (ggml-org#13552) (ggml-org#16282) presets : fix pooling param for embedding models (ggml-org#16455) ...

LostRuins · 2025-10-13T02:32:47Z

Hello, it seems like this PR causes degradation when used with TTS.cpp running Kokoro. Reverting all changes done to ops.cpp in this commit solves the issue, so I would suspect that there may be some scenarios where it is returning significantly different outputs than before.

Using an Intel i9 13980hx CPU (avx2 enabled, no avx512)

Audio Before:

before.mp4

Audio After:

after.mp4

ggerganov · 2025-10-13T07:11:28Z

@LostRuins Should be fixed in #16558

This reverts commit 20678dd.

LostRuins · 2025-10-13T07:32:28Z

Thanks, seems to be working from a quick test

duduta · 2025-10-13T11:45:55Z

Sorry, @LostRuins , thanks @ggerganov for fixing this

LostRuins · 2025-10-13T11:47:22Z

all good, i'll let you know if any other issues come up.

@taronaeo

* ggml-cpu: optimize norm operation to use intrinsics or Accelerate rename function add endif macro comment Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com> * implement s390x SIMD suggested by @taronaeo * add TODO comment * tidy up spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Sep 12, 2025

duduta force-pushed the optimize-ggml-cpu-norm branch from c38f290 to df16d10 Compare September 13, 2025 15:27

taronaeo reviewed Sep 13, 2025

View reviewed changes

ggml/src/ggml-cpu/vec.cpp Show resolved Hide resolved

taronaeo requested changes Sep 13, 2025

View reviewed changes

ggml/src/ggml-cpu/vec.cpp Outdated Show resolved Hide resolved

duduta requested a review from taronaeo September 13, 2025 20:59

duduta force-pushed the optimize-ggml-cpu-norm branch from af54a93 to 2853109 Compare September 22, 2025 11:53

ggerganov approved these changes Sep 22, 2025

View reviewed changes

ggml/src/ggml-cpu/vec.h Outdated Show resolved Hide resolved

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

duduta force-pushed the optimize-ggml-cpu-norm branch 2 times, most recently from 654f1e6 to fc759be Compare September 22, 2025 14:46

taronaeo reviewed Sep 22, 2025

View reviewed changes

ggml/src/ggml-cpu/vec.cpp Outdated Show resolved Hide resolved

duduta requested a review from slaren as a code owner September 22, 2025 15:34

taronaeo approved these changes Sep 22, 2025

View reviewed changes

slaren reviewed Oct 7, 2025

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

duduta force-pushed the optimize-ggml-cpu-norm branch from 04d56f9 to 31eb135 Compare October 7, 2025 16:23

ggerganov reviewed Oct 7, 2025

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Show resolved Hide resolved

slaren reviewed Oct 7, 2025

View reviewed changes

ggml/src/ggml-cpu/vec.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/vec.cpp Show resolved Hide resolved

slaren approved these changes Oct 7, 2025

View reviewed changes

CISC reviewed Oct 8, 2025

View reviewed changes

ggml/src/ggml-cpu/vec.cpp Outdated Show resolved Hide resolved

duduta force-pushed the optimize-ggml-cpu-norm branch from 7dae677 to 7e986ec Compare October 8, 2025 10:36

duduta and others added 3 commits October 8, 2025 13:38

ggml-cpu: optimize norm operation to use intrinsics or Accelerate

b4ce478

rename function add endif macro comment Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>

implement s390x SIMD suggested by @taronaeo

2d66042

add TODO comment

7e986ec

tidy up spaces

2de5290

CISC merged commit 1deee0f into ggml-org:master Oct 9, 2025
69 checks passed

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Oct 13, 2025

revert ggml-org#15953 for now as it breaks kokoro

20678dd

ggerganov mentioned this pull request Oct 13, 2025

ggml : fix scalar path for computing norm #16558

Merged

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Oct 13, 2025

Revert "revert ggml-org#15953 for now as it breaks kokoro"

c6884a1

This reverts commit 20678dd.

duduta deleted the optimize-ggml-cpu-norm branch October 14, 2025 05:20

sandrohanea mentioned this pull request Oct 14, 2025

Fixed scalar ggml_vec_cvar_f32 ggml-org/whisper.cpp#3477

Closed

ggml-cpu: optimize the ggml NORM operation #15953

ggml-cpu: optimize the ggml NORM operation #15953

Uh oh!

Conversation

duduta commented Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

duduta commented Sep 22, 2025

Uh oh!

Uh oh!

CISC commented Oct 7, 2025

Uh oh!

taronaeo commented Oct 7, 2025

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Oct 9, 2025

Uh oh!

Uh oh!

LostRuins commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

LostRuins commented Oct 13, 2025

Uh oh!

duduta commented Oct 13, 2025

Uh oh!

LostRuins commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LostRuins commented Oct 13, 2025 •

edited

Loading