Add SIMD implementation of ggml_compute_forward_rms_norm_f32 #450

slaren · 2023-03-24T01:53:09Z

Using the GGML SIMD macros so hopefully it should work on different architectures, but only tested with AVX 2.

Don't expect any meaningful performance improvement, the function is not very hot.

Perplexity after this change (7B, q4_0): 6.5980

Full run output

./main -m ./models/7B/ggml-model-q4_0.bin --perplexity -t 12 -f wikitext-2-raw/wiki.test.raw
main: seed = 1679622068
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291

system_info: n_threads = 12 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 655 chunks
44.42 seconds per pass - ETA 8.08 hours
[1]4.7114,[2]5.1906,[3]6.0631,[4]6.7550,[5]6.8446,[6]6.8386,[7]7.0409,[8]7.1527,[9]7.5598,[10]7.7969,[11]8.0295,[12]8.0414,[13]7.9639,[14]8.0505,[15]8.3085,[16]7.8944,[17]7.7598,[18]7.7382,[19]7.3351,[20]7.3199,[21]7.2169,[22]7.0259,[23]6.9849,[24]6.8973,[25]6.9077,[26]6.7192,[27]6.5322,[28]6.4300,[29]6.3396,[30]6.1645,[31]6.1348,[32]6.1517,[33]6.0782,[34]6.1209,[35]6.1473,[36]6.1898,[37]6.1899,[38]6.2031,[39]6.2446,[40]6.3025,[41]6.3138,[42]6.3502,[43]6.3025,[44]6.3605,[45]6.3612,[46]6.3311,[47]6.3573,[48]6.3292,[49]6.3372,[50]6.2981,[51]6.2917,[52]6.2815,[53]6.3305,[54]6.3116,[55]6.2910,[56]6.3316,[57]6.3587,[58]6.3818,[59]6.3926,[60]6.4389,[61]6.4274,[62]6.4922,[63]6.5292,[64]6.5446,[65]6.5959,[66]6.6112,[67]6.6276,[68]6.6449,[69]6.6738,[70]6.7064,[71]6.7333,[72]6.7657,[73]6.8332,[74]6.8364,[75]6.8536,[76]6.8687,[77]6.8844,[78]6.8698,[79]6.8997,[80]6.8911,[81]6.9051,[82]6.9124,[83]6.8562,[84]6.8423,[85]6.8377,[86]6.8144,[87]6.7518,[88]6.7245,[89]6.7046,[90]6.6873,[91]6.7184,[92]6.7128,[93]6.7157,[94]6.7149,[95]6.7477,[96]6.7466,[97]6.7416,[98]6.7350,[99]6.7177,[100]6.7154,[101]6.7402,[102]6.7337,[103]6.7586,[104]6.7642,[105]6.7633,[106]6.7790,[107]6.7792,[108]6.7870,[109]6.7803,[110]6.7725,[111]6.7956,[112]6.8164,[113]6.8175,[114]6.8138,[115]6.8237,[116]6.8156,[117]6.8208,[118]6.8504,[119]6.8723,[120]6.9123,[121]6.9303,[122]6.9570,[123]6.9971,[124]7.0147,[125]7.0044,[126]7.0463,[127]7.0859,[128]7.1153,[129]7.0964,[130]7.1063,[131]7.1002,[132]7.0924,[133]7.0800,[134]7.0907,[135]7.0879,[136]7.0736,[137]7.0657,[138]7.0504,[139]7.0392,[140]7.0361,[141]7.0095,[142]7.0062,[143]6.9775,[144]6.9563,[145]6.9482,[146]6.9339,[147]6.9402,[148]6.9417,[149]6.9379,[150]6.9329,[151]6.9353,[152]6.9261,[153]6.9078,[154]6.8983,[155]6.9044,[156]6.9000,[157]6.9182,[158]6.9210,[159]6.9267,[160]6.9309,[161]6.9435,[162]6.9103,[163]6.8968,[164]6.8695,[165]6.8349,[166]6.8041,[167]6.7628,[168]6.7285,[169]6.7135,[170]6.7002,[171]6.6698,[172]6.6515,[173]6.6329,[174]6.5997,[175]6.5754,[176]6.5626,[177]6.5421,[178]6.5181,[179]6.5002,[180]6.4903,[181]6.4657,[182]6.4461,[183]6.4302,[184]6.4303,[185]6.4221,[186]6.4243,[187]6.4300,[188]6.4266,[189]6.4453,[190]6.4473,[191]6.4681,[192]6.4844,[193]6.5029,[194]6.5150,[195]6.5373,[196]6.5548,[197]6.5795,[198]6.5967,[199]6.5986,[200]6.6019,[201]6.5977,[202]6.6201,[203]6.6265,[204]6.6283,[205]6.6391,[206]6.6467,[207]6.6421,[208]6.6507,[209]6.6569,[210]6.6620,[211]6.6745,[212]6.6828,[213]6.6932,[214]6.6982,[215]6.7015,[216]6.7164,[217]6.7344,[218]6.7478,[219]6.7485,[220]6.7446,[221]6.7388,[222]6.7352,[223]6.7233,[224]6.7170,[225]6.7116,[226]6.7334,[227]6.7449,[228]6.7513,[229]6.7583,[230]6.7549,[231]6.7718,[232]6.7580,[233]6.7394,[234]6.7227,[235]6.7078,[236]6.6991,[237]6.6887,[238]6.6921,[239]6.6745,[240]6.6629,[241]6.6677,[242]6.6716,[243]6.6695,[244]6.6565,[245]6.6527,[246]6.6399,[247]6.6273,[248]6.6190,[249]6.6171,[250]6.6223,[251]6.6147,[252]6.6105,[253]6.6001,[254]6.5960,[255]6.5826,[256]6.5625,[257]6.5502,[258]6.5418,[259]6.5396,[260]6.5309,[261]6.5267,[262]6.5210,[263]6.5152,[264]6.4968,[265]6.4964,[266]6.4952,[267]6.4883,[268]6.5001,[269]6.4982,[270]6.4991,[271]6.5069,[272]6.5119,[273]6.5104,[274]6.5121,[275]6.5214,[276]6.5264,[277]6.5448,[278]6.5559,[279]6.5643,[280]6.5684,[281]6.5794,[282]6.5856,[283]6.6008,[284]6.6084,[285]6.6175,[286]6.6323,[287]6.6317,[288]6.6380,[289]6.6283,[290]6.6129,[291]6.5968,[292]6.5802,[293]6.5651,[294]6.5679,[295]6.5664,[296]6.5702,[297]6.5689,[298]6.5720,[299]6.5691,[300]6.5572,[301]6.5569,[302]6.5489,[303]6.5398,[304]6.5298,[305]6.5279,[306]6.5141,[307]6.5170,[308]6.5205,[309]6.5038,[310]6.4971,[311]6.4907,[312]6.4925,[313]6.4870,[314]6.4867,[315]6.4689,[316]6.4651,[317]6.4476,[318]6.4244,[319]6.4370,[320]6.4510,[321]6.4545,[322]6.4497,[323]6.4427,[324]6.4403,[325]6.4499,[326]6.4502,[327]6.4524,[328]6.4572,[329]6.4636,[330]6.4661,[331]6.4793,[332]6.4760,[333]6.4837,[334]6.4771,[335]6.4699,[336]6.4734,[337]6.4695,[338]6.4683,[339]6.4627,[340]6.4580,[341]6.4661,[342]6.4688,[343]6.4744,[344]6.4746,[345]6.4746,[346]6.4714,[347]6.4760,[348]6.4803,[349]6.4819,[350]6.4784,[351]6.4788,[352]6.4786,[353]6.4732,[354]6.4738,[355]6.4794,[356]6.4822,[357]6.4782,[358]6.4877,[359]6.4910,[360]6.4871,[361]6.4870,[362]6.4944,[363]6.5066,[364]6.5130,[365]6.5190,[366]6.5199,[367]6.5286,[368]6.5261,[369]6.5274,[370]6.5282,[371]6.5219,[372]6.5270,[373]6.5327,[374]6.5307,[375]6.5295,[376]6.5383,[377]6.5327,[378]6.5349,[379]6.5415,[380]6.5315,[381]6.5270,[382]6.5203,[383]6.5184,[384]6.5177,[385]6.5165,[386]6.5160,[387]6.5152,[388]6.5104,[389]6.5041,[390]6.4966,[391]6.4883,[392]6.4841,[393]6.4827,[394]6.4852,[395]6.4835,[396]6.4754,[397]6.4834,[398]6.4876,[399]6.4972,[400]6.4971,[401]6.4985,[402]6.4995,[403]6.5011,[404]6.5080,[405]6.4991,[406]6.4954,[407]6.4952,[408]6.4959,[409]6.5087,[410]6.5207,[411]6.5333,[412]6.5502,[413]6.5618,[414]6.5694,[415]6.5756,[416]6.5837,[417]6.5978,[418]6.6021,[419]6.6099,[420]6.6196,[421]6.6315,[422]6.6367,[423]6.6440,[424]6.6563,[425]6.6661,[426]6.6733,[427]6.6780,[428]6.6866,[429]6.6908,[430]6.7000,[431]6.7147,[432]6.7190,[433]6.7174,[434]6.7120,[435]6.7125,[436]6.7151,[437]6.7250,[438]6.7328,[439]6.7290,[440]6.7280,[441]6.7224,[442]6.7208,[443]6.7218,[444]6.7228,[445]6.7206,[446]6.7228,[447]6.7260,[448]6.7301,[449]6.7274,[450]6.7279,[451]6.7233,[452]6.7121,[453]6.7037,[454]6.6979,[455]6.6985,[456]6.7033,[457]6.7055,[458]6.7031,[459]6.7040,[460]6.7135,[461]6.7108,[462]6.7094,[463]6.7146,[464]6.7136,[465]6.7108,[466]6.7027,[467]6.7036,[468]6.7038,[469]6.7060,[470]6.7068,[471]6.7016,[472]6.7068,[473]6.7008,[474]6.7029,[475]6.6973,[476]6.6996,[477]6.6927,[478]6.6921,[479]6.6996,[480]6.7051,[481]6.7071,[482]6.7030,[483]6.6990,[484]6.7020,[485]6.7000,[486]6.6941,[487]6.6946,[488]6.6929,[489]6.6875,[490]6.6843,[491]6.6814,[492]6.6756,[493]6.6726,[494]6.6708,[495]6.6707,[496]6.6675,[497]6.6622,[498]6.6607,[499]6.6553,[500]6.6454,[501]6.6386,[502]6.6379,[503]6.6377,[504]6.6279,[505]6.6308,[506]6.6315,[507]6.6257,[508]6.6216,[509]6.6201,[510]6.6240,[511]6.6291,[512]6.6326,[513]6.6340,[514]6.6411,[515]6.6351,[516]6.6339,[517]6.6345,[518]6.6341,[519]6.6370,[520]6.6400,[521]6.6418,[522]6.6447,[523]6.6456,[524]6.6526,[525]6.6567,[526]6.6575,[527]6.6597,[528]6.6540,[529]6.6548,[530]6.6494,[531]6.6476,[532]6.6525,[533]6.6549,[534]6.6524,[535]6.6553,[536]6.6498,[537]6.6471,[538]6.6526,[539]6.6534,[540]6.6576,[541]6.6587,[542]6.6596,[543]6.6612,[544]6.6628,[545]6.6606,[546]6.6614,[547]6.6565,[548]6.6504,[549]6.6504,[550]6.6469,[551]6.6433,[552]6.6412,[553]6.6366,[554]6.6341,[555]6.6310,[556]6.6309,[557]6.6336,[558]6.6299,[559]6.6293,[560]6.6289,[561]6.6291,[562]6.6272,[563]6.6273,[564]6.6315,[565]6.6336,[566]6.6331,[567]6.6310,[568]6.6312,[569]6.6295,[570]6.6322,[571]6.6327,[572]6.6333,[573]6.6334,[574]6.6299,[575]6.6300,[576]6.6299,[577]6.6288,[578]6.6262,[579]6.6271,[580]6.6203,[581]6.6161,[582]6.6151,[583]6.6154,[584]6.6156,[585]6.6077,[586]6.6004,[587]6.6010,[588]6.6060,[589]6.6120,[590]6.6149,[591]6.6167,[592]6.6153,[593]6.6115,[594]6.6123,[595]6.6099,[596]6.6143,[597]6.6116,[598]6.6080,[599]6.6104,[600]6.6103,[601]6.6091,[602]6.6112,[603]6.6140,[604]6.6152,[605]6.6187,[606]6.6208,[607]6.6196,[608]6.6157,[609]6.6167,[610]6.6204,[611]6.6183,[612]6.6212,[613]6.6174,[614]6.6119,[615]6.6039,[616]6.6071,[617]6.6007,[618]6.5950,[619]6.5888,[620]6.5736,[621]6.5660,[622]6.5641,[623]6.5657,[624]6.5661,[625]6.5661,[626]6.5648,[627]6.5672,[628]6.5677,[629]6.5671,[630]6.5709,[631]6.5772,[632]6.5829,[633]6.5810,[634]6.5845,[635]6.5851,[636]6.5819,[637]6.5787,[638]6.5816,[639]6.5786,[640]6.5796,[641]6.5797,[642]6.5868,[643]6.5890,[644]6.5903,[645]6.5879,[646]6.5922,[647]6.5892,[648]6.5903,[649]6.5902,[650]6.5939,[651]6.5999,[652]6.6006,[653]6.6051,[654]6.5984,[655]6.5980,

ggerganov · 2023-03-24T15:07:36Z

I think if the performance does not change it is not worth making the code too cumbersome.
Will think about this some more and maybe merge at a later stage

slaren · 2023-04-29T16:17:04Z

Closing for now since it doesn't look like this is going to be useful any time soon, there are many other ops that would be more important to optimize than this.

Add AVX2 implementation of ggml_compute_forward_rms_norm_f32

acc36eb

slaren marked this pull request as ready for review March 24, 2023 12:27

anzz1 added enhancement New feature or request performance Speed related topics hardware Hardware related labels Mar 27, 2023

slaren closed this Apr 29, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SIMD implementation of ggml_compute_forward_rms_norm_f32 #450

Add SIMD implementation of ggml_compute_forward_rms_norm_f32 #450

slaren commented Mar 24, 2023 •

edited

Loading

ggerganov commented Mar 24, 2023

slaren commented Apr 29, 2023

Add SIMD implementation of ggml_compute_forward_rms_norm_f32 #450

Add SIMD implementation of ggml_compute_forward_rms_norm_f32 #450

Conversation

slaren commented Mar 24, 2023 • edited Loading

ggerganov commented Mar 24, 2023

slaren commented Apr 29, 2023

slaren commented Mar 24, 2023 •

edited

Loading