Avx512 perf improvements #996

cchudant · 2023-03-12T17:58:16Z

This PR brings a few things to the avx512 linalg kernels added in #989:

Kernels are now auto-generated based on mr/nr from the build.rs script
Wider kernels are available
Fixed: The benchmarks were not taking cold cache into account anymore. Turns out (0..10000000).collect::<Vec<_>>() gets optimized out completely by llvm now, meaning all my cold cache results were wrong :)
Added a benchmark that tests every avx512 kernel size, and a python script that a csv with the cache-cold throughputs for all the kernels
Kernel selection has been updated with the wider kernels and to account for the cold cache results
Various improvements on the assembly code - scatter gather ops

Kernel results

Results are in Gelem/s. from a cold start by ruining the cache beforehand and M=1024, K=1000
Intel(R) Xeon(R) Gold 6334 CPU @ 3.60GHz

future work

I think i'm pretty much done on the asm kernel part, I think I won't be able to squeeze out any more perf there

On a slightly higher level, border tile handling is still suboptimal: we can still improve the perf when N is low.
As you can see from the graph:

Gelem/s

Between N=14 and N=15, we have a big drop from 95 Gelem/s with the 32x14 kernel to 63 Gelem/s with the 23x8 kernel
This is because with the x14 kernel, we compute 13 useless elements when N=15 from the border tile of the C matrix.
This gets better as N gets bigger, as the waste ratio is lower.

Gelem/s

On a higher level,
@kali is working on collapsing consecutive dimensions in input matrices in matmuls.
I think this may solve why I'm still getting a run time of 6s instead of 300ms on my transformer models, so I don't think I'll start bothering with the border tiles just yet

Other than that, I don't think we're that far from being on par with onnxruntime perf! :)

…el selection

kali · 2023-03-13T12:14:10Z

Awesome work again @cchudant :)

I don't have much in terms of setup for double checking intel implementations... So... @tgolsson, would you like to do a review here ? Even better, check on some of your use-cases if this has unfavorable impact ?

tgolsson · 2023-03-13T13:32:37Z

Awesome! Yes; I'm quite busy rest of the day but I can look tomorrow! I haven't had time to test the AVX512 changes yet; so I'll try to do base->AVX512->this deltas!

tgolsson · 2023-03-14T11:53:22Z

FYI started looking at benchmarks but since we hadn't bumped from version 0.17 I've got a bit of work to do. Symbolication/batch dimensions have changed a bit so it's crashing in concretization. I'll do a review after lunch at least.

In case you have a thought @kali this is the issue:

[/home/ts/Repositories/cervo/crates/cervo-core/src/inferer/helpers.rs:38] &model.symbol_table = unk__16819 unk__16820 unk__16821 unk__16822 N
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Failed analyse for node #65 "model_4_concatenate_concat" InferenceConcat

Caused by:
    0: Infering facts
    1: Applying rule inputs[0].shape[0] == inputs[1].shape[0]
    2: Impossible to unify Sym(unk__16820) with Sym(unk__16819).', src/main.rs:66:62

Happens here on this upgrade PR: https://github.com/EmbarkStudios/cervo/pull/44/files#diff-4a7fb40e77a22ccba64ba761a0c31ab388127a6309b79e1c7832602c1755d3dcR39. If you want to try the code,

$ cd benchmarks/perf-test
$ cargo run batch-scaling 100 "/tmp/dump" -o ../../brains/test-large.onnx -b `seq -s, 1 24`

Will try batch-sizes 1..24 of all the various batching mechanisms cervo has.

kali · 2023-03-14T11:57:33Z

It looks to me like a typical case of overspecification of inputs and output in ONNX. (Prior to 0.19, tract was ignoring them, this was a bug. Now it's trying to make sense of them.) Try to cancel the the output_shapes: .with_output_facts(0, Default::default()) immediately after loading the onnx file, before it gets analysed

tgolsson · 2023-03-14T13:50:33Z

OK; now I've managed to get this to run. I've compared my current production (0.17) to 0.19.2¹, main, and this branch. You can see this below. This is a conv-stack of 32x32 followed an MLP [1024,1024]. This branch is definitely an improvement. But there's some weird ones compared to avx256 like getting much worse in the 6-wide case and then much better in the 7-wide case? Maybe I'll see the reason why in the code.

Starting review now!

It looks like something's fishy with version - main is on 0.19.3-pre, but latest release is 0.19.7? Not sure what to compare against. But I'm hoping there's not a huge difference between 0.19.2 and whatever is used here right now. ↩

cchudant · 2023-03-14T14:53:51Z

ah, that's kinda bad
my kernels are actually slower than avx256?
there is something wrong with my methodology then?

kali · 2023-03-14T15:06:31Z

Just ignore the pre-version tag on main. main will become 0.20. There is a maintenance branch for 0.19 called 0.19.x

tgolsson

Awesome work! This looks so much cleaner, and only a tiny bit more complex.

As noted, I'm a bit saddened by the performance. I've pointed out a few concerns, but I'm really not seeing anything super-obvious that would lead to this kind of slowdown. I'll see if I can repro your measurements with the script too :-)

tgolsson · 2023-03-14T14:36:33Z

linalg/src/x86_64_fma.rs

@@ -96,14 +100,71 @@ fn plug_fma(ops: &mut Ops) {
 fn plug_avx512f(ops: &mut Ops) {
    ops.mmv_f32 = Box::new(|m, _k| match m {
        Some(m) if m < 31 => mmm::avx512_mmm_f32_16x1::mmm(),
-        _ => mmm::avx512_mmm_f32_128x1::mmm(),
+        _ => mmm::avx512_mmm_f32_96x1::mmm(),


Why 96x1 over 128x1? :-)

(I see it looks better in the table now; which I guess makes sense why this is like it is. It's unexpected to me.)

For MMV, we're hitting very low throughput all across the board, regardless of M
My thinking was that according to my benchmarks, lowering 128 to 96 does not cause harm, and it could help with border kernels on matrices that are not multiples of 128

tgolsson · 2023-03-14T14:53:02Z

linalg/x86_64/avx512/f32_add_mat_mul.tmpliq

+    {% for cur_unroll_count in (0..unroll_min_1) %}
+
+        {% for i in (0..prefetches_to_issue_min_1) %}
+            prefetcht0 [rax + {{i | times:64}} + {{m_total_bytes | times:prefetch_dist}} + {{cur_unroll_count | times:m_total_bytes}}]


Have you measured that this is worth it? My tests on AVX256 saw perf losses from any type of prefetching.

I haven't mesured it, but it did not seem to cause harm according to vtune, and onnxruntime also does it
I'll measure it

tgolsson · 2023-03-14T15:01:28Z

linalg/x86_64/avx512/f32_add_mat_mul.tmpliq

+        {% endfor %}
+
+        {% for i in (0..nr_min_1) %}
+            vbroadcastss zmm{{col_reg}}, dword ptr [rcx + {{i | times:4}} + {{cur_unroll_count | times:n_total_bytes}}]


Have you tested using two alternating column regs? It'd maybe alleviate some register pressure.

I'll try, I thought this couldnt be an issue because of register renaming

It shouldn't, but you never know. :-)

tgolsson · 2023-03-14T15:08:00Z

linalg/x86_64/avx512/f32_store_clear.tmpliq

+
+    {% for row in (0..mr_arch_min_1) %}
+        kxnorw      k1,k1,k1            // set writemask to ones
+        vscatterdps [r9 + zmm31]{k1}, zmm{{col | times:mr_arch | plus:row}}  


I'm eyeing this line as a potential slowdown. vscatterdps is hugely expensive. Each row here is 43 uops!

Woah, I did not know about that

It's only 36 on Cannon Lake and Ice Lake, 43 on SKX; fwiw. https://www.agner.org/optimize/instruction_tables.pdf

tgolsson · 2023-03-14T15:13:59Z

Out of curiosity, have you ran this under anything like VTune or Advisor? I'm curious what they'd say and identify as concerns.

cchudant · 2023-03-14T15:29:42Z

Out of curiosity, have you ran this under anything like VTune or Advisor? I'm curious what they'd say and identify as concerns.

Yes, vtune. Obviously there is something I'm missing here though, so i'll get back on this.

Do you have a script so that I can try and reproduce your results?

tgolsson · 2023-03-14T15:49:01Z

Clone cervo on the ts/tract-0193 branch. You'll need to edit the root Cargo.toml and benchmarks/perf-test/Cargo.toml to override to whichever git you want to test with. To use from crates.io you'll also need to change the pinned versions in all other Cargo.toml-s to something like =0.19.2 since 0.19.3-pre isn't on crates.io

Then you can run it like this:

$ cd benchmarks/perf-test
$ cargo run batch-scaling 10000 "some-path/some-name.csv" -o ../../brains/test-large.onnx -b `seq -s, 1 24`

Once you've tested a bunch of PRs you can do

$ python3 ../python/compare_batchsize.py \
       some-path/first.csv,some-path/second.csv,... \
       test1label,test2label,... \
       10000 \
       some-path/output.png

The script should handle any number of comparisons, they're all relative to the first file.
E.g., mine was

python3 ../python/compare_batchsize.py \
      ../out/batchsize-avx256-017.csv,../out/batchsize-avx256.csv,../out/batchsize-avx512.csv,../out/batchsize-avx512-imp.csv \
      avx256-0.17,avx256-0.19.2,avx512,avx512b \
      10000 \
      ../out/batchsize-compare-dynamic.png

tgolsson · 2023-03-14T17:04:06Z

Nowhere near as pretty, but more importantly worse values than yours:

cchudant · 2023-03-15T20:35:43Z

Okay, I did not have that much time to look at this today, but I managed to replicate the regression locally, and looking at vtune, apparently i'm spending wayy too much time on the MMV kernels it looks like?

I was very worried that this was something I wouldnt be able to replicate, seeing that your kernel times are so different. Fortunately it seems that's not the case

I'll look further tomorrow, but something is fishy here and it's my fault :)

tgolsson · 2023-03-16T08:44:44Z

How easy would it be to remove loop unrolling? We could be cramping the instruction decoder/cache.

cchudant · 2023-03-16T13:42:39Z

Very easy, it's just setting the loop unroll variable to 1 and commenting the jumps
That's a good point, i'll test that when i get back to it

kali · 2023-04-18T07:08:01Z

Hey folks, I'm planning on cutting 0.20.x in a week or so. Do we have a path towards making these optimisations a part of it ?

cchudant and others added 7 commits March 12, 2023 16:42

kernel generation

d5f38b8

fix kernel gen, throughput benchmarks

5109d97

unroll on k

38b1791

better scatter/gather, feature for compiling all kernels, better kern…

c2304c1

…el selection

fix CI issues

9fa9601

fix compile issue

8ed3477

fix kernel definition on rust side

e8f82bc

tgolsson reviewed Mar 14, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avx512 perf improvements #996

Avx512 perf improvements #996

cchudant commented Mar 12, 2023 •

edited

Loading

kali commented Mar 13, 2023

tgolsson commented Mar 13, 2023

tgolsson commented Mar 14, 2023 •

edited

Loading

kali commented Mar 14, 2023

tgolsson commented Mar 14, 2023

cchudant commented Mar 14, 2023

kali commented Mar 14, 2023

tgolsson left a comment

tgolsson Mar 14, 2023

tgolsson Mar 14, 2023

cchudant Mar 14, 2023

tgolsson Mar 14, 2023

cchudant Mar 14, 2023

tgolsson Mar 14, 2023

cchudant Mar 14, 2023

tgolsson Mar 14, 2023

tgolsson Mar 14, 2023

cchudant Mar 14, 2023

tgolsson Mar 14, 2023

tgolsson commented Mar 14, 2023

cchudant commented Mar 14, 2023

tgolsson commented Mar 14, 2023

tgolsson commented Mar 14, 2023

cchudant commented Mar 15, 2023

tgolsson commented Mar 16, 2023

cchudant commented Mar 16, 2023

kali commented Apr 18, 2023

Avx512 perf improvements #996

Are you sure you want to change the base?

Avx512 perf improvements #996

Conversation

cchudant commented Mar 12, 2023 • edited Loading

future work

kali commented Mar 13, 2023

tgolsson commented Mar 13, 2023

tgolsson commented Mar 14, 2023 • edited Loading

kali commented Mar 14, 2023

tgolsson commented Mar 14, 2023

Footnotes

cchudant commented Mar 14, 2023

kali commented Mar 14, 2023

tgolsson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgolsson commented Mar 14, 2023

cchudant commented Mar 14, 2023

tgolsson commented Mar 14, 2023

tgolsson commented Mar 14, 2023

cchudant commented Mar 15, 2023

tgolsson commented Mar 16, 2023

cchudant commented Mar 16, 2023

kali commented Apr 18, 2023

cchudant commented Mar 12, 2023 •

edited

Loading

tgolsson commented Mar 14, 2023 •

edited

Loading