Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avx512 perf improvements #996

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Avx512 perf improvements #996

wants to merge 7 commits into from

Conversation

cchudant
Copy link
Contributor

@cchudant cchudant commented Mar 12, 2023

This PR brings a few things to the avx512 linalg kernels added in #989:

  • Kernels are now auto-generated based on mr/nr from the build.rs script
  • Wider kernels are available
  • Fixed: The benchmarks were not taking cold cache into account anymore. Turns out (0..10000000).collect::<Vec<_>>() gets optimized out completely by llvm now, meaning all my cold cache results were wrong :)
  • Added a benchmark that tests every avx512 kernel size, and a python script that a csv with the cache-cold throughputs for all the kernels
  • Kernel selection has been updated with the wider kernels and to account for the cold cache results
  • Various improvements on the assembly code - scatter gather ops

Kernel results
image
Results are in Gelem/s. from a cold start by ruining the cache beforehand and M=1024, K=1000
Intel(R) Xeon(R) Gold 6334 CPU @ 3.60GHz

future work

I think i'm pretty much done on the asm kernel part, I think I won't be able to squeeze out any more perf there

On a slightly higher level, border tile handling is still suboptimal: we can still improve the perf when N is low.
As you can see from the graph:
image
Gelem/s

Between N=14 and N=15, we have a big drop from 95 Gelem/s with the 32x14 kernel to 63 Gelem/s with the 23x8 kernel
This is because with the x14 kernel, we compute 13 useless elements when N=15 from the border tile of the C matrix.
This gets better as N gets bigger, as the waste ratio is lower.
image
Gelem/s

On a higher level,
@kali is working on collapsing consecutive dimensions in input matrices in matmuls.
I think this may solve why I'm still getting a run time of 6s instead of 300ms on my transformer models, so I don't think I'll start bothering with the border tiles just yet

Other than that, I don't think we're that far from being on par with onnxruntime perf! :)

@kali
Copy link
Collaborator

kali commented Mar 13, 2023

Awesome work again @cchudant :)

I don't have much in terms of setup for double checking intel implementations... So... @tgolsson, would you like to do a review here ? Even better, check on some of your use-cases if this has unfavorable impact ?

@tgolsson
Copy link
Contributor

Awesome! Yes; I'm quite busy rest of the day but I can look tomorrow! I haven't had time to test the AVX512 changes yet; so I'll try to do base->AVX512->this deltas!

@tgolsson
Copy link
Contributor

tgolsson commented Mar 14, 2023

FYI started looking at benchmarks but since we hadn't bumped from version 0.17 I've got a bit of work to do. Symbolication/batch dimensions have changed a bit so it's crashing in concretization. I'll do a review after lunch at least.

In case you have a thought @kali this is the issue:

[/home/ts/Repositories/cervo/crates/cervo-core/src/inferer/helpers.rs:38] &model.symbol_table = unk__16819 unk__16820 unk__16821 unk__16822 N
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Failed analyse for node #65 "model_4_concatenate_concat" InferenceConcat

Caused by:
    0: Infering facts
    1: Applying rule inputs[0].shape[0] == inputs[1].shape[0]
    2: Impossible to unify Sym(unk__16820) with Sym(unk__16819).', src/main.rs:66:62

Happens here on this upgrade PR: https://github.com/EmbarkStudios/cervo/pull/44/files#diff-4a7fb40e77a22ccba64ba761a0c31ab388127a6309b79e1c7832602c1755d3dcR39. If you want to try the code,

$ cd benchmarks/perf-test
$ cargo run batch-scaling 100 "/tmp/dump" -o ../../brains/test-large.onnx -b `seq -s, 1 24`

Will try batch-sizes 1..24 of all the various batching mechanisms cervo has.

@kali
Copy link
Collaborator

kali commented Mar 14, 2023

It looks to me like a typical case of overspecification of inputs and output in ONNX. (Prior to 0.19, tract was ignoring them, this was a bug. Now it's trying to make sense of them.) Try to cancel the the output_shapes: .with_output_facts(0, Default::default()) immediately after loading the onnx file, before it gets analysed

@tgolsson
Copy link
Contributor

OK; now I've managed to get this to run. I've compared my current production (0.17) to 0.19.21, main, and this branch. You can see this below. This is a conv-stack of 32x32 followed an MLP [1024,1024]. This branch is definitely an improvement. But there's some weird ones compared to avx256 like getting much worse in the 6-wide case and then much better in the 7-wide case? Maybe I'll see the reason why in the code.

avx512

Starting review now!

Footnotes

  1. It looks like something's fishy with version - main is on 0.19.3-pre, but latest release is 0.19.7? Not sure what to compare against. But I'm hoping there's not a huge difference between 0.19.2 and whatever is used here right now.

@cchudant
Copy link
Contributor Author

ah, that's kinda bad
my kernels are actually slower than avx256?
there is something wrong with my methodology then?

@kali
Copy link
Collaborator

kali commented Mar 14, 2023

Just ignore the pre-version tag on main. main will become 0.20. There is a maintenance branch for 0.19 called 0.19.x

Copy link
Contributor

@tgolsson tgolsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! This looks so much cleaner, and only a tiny bit more complex.

As noted, I'm a bit saddened by the performance. I've pointed out a few concerns, but I'm really not seeing anything super-obvious that would lead to this kind of slowdown. I'll see if I can repro your measurements with the script too :-)

@@ -96,14 +100,71 @@ fn plug_fma(ops: &mut Ops) {
fn plug_avx512f(ops: &mut Ops) {
ops.mmv_f32 = Box::new(|m, _k| match m {
Some(m) if m < 31 => mmm::avx512_mmm_f32_16x1::mmm(),
_ => mmm::avx512_mmm_f32_128x1::mmm(),
_ => mmm::avx512_mmm_f32_96x1::mmm(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 96x1 over 128x1? :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I see it looks better in the table now; which I guess makes sense why this is like it is. It's unexpected to me.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MMV, we're hitting very low throughput all across the board, regardless of M
My thinking was that according to my benchmarks, lowering 128 to 96 does not cause harm, and it could help with border kernels on matrices that are not multiples of 128

{% for cur_unroll_count in (0..unroll_min_1) %}

{% for i in (0..prefetches_to_issue_min_1) %}
prefetcht0 [rax + {{i | times:64}} + {{m_total_bytes | times:prefetch_dist}} + {{cur_unroll_count | times:m_total_bytes}}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you measured that this is worth it? My tests on AVX256 saw perf losses from any type of prefetching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't mesured it, but it did not seem to cause harm according to vtune, and onnxruntime also does it
I'll measure it

{% endfor %}

{% for i in (0..nr_min_1) %}
vbroadcastss zmm{{col_reg}}, dword ptr [rcx + {{i | times:4}} + {{cur_unroll_count | times:n_total_bytes}}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested using two alternating column regs? It'd maybe alleviate some register pressure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try, I thought this couldnt be an issue because of register renaming

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't, but you never know. :-)


{% for row in (0..mr_arch_min_1) %}
kxnorw k1,k1,k1 // set writemask to ones
vscatterdps [r9 + zmm31]{k1}, zmm{{col | times:mr_arch | plus:row}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm eyeing this line as a potential slowdown. vscatterdps is hugely expensive. Each row here is 43 uops!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, I did not know about that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only 36 on Cannon Lake and Ice Lake, 43 on SKX; fwiw. https://www.agner.org/optimize/instruction_tables.pdf

@tgolsson
Copy link
Contributor

Out of curiosity, have you ran this under anything like VTune or Advisor? I'm curious what they'd say and identify as concerns.

@cchudant
Copy link
Contributor Author

Out of curiosity, have you ran this under anything like VTune or Advisor? I'm curious what they'd say and identify as concerns.

Yes, vtune. Obviously there is something I'm missing here though, so i'll get back on this.

Do you have a script so that I can try and reproduce your results?

@tgolsson
Copy link
Contributor

Clone cervo on the ts/tract-0193 branch. You'll need to edit the root Cargo.toml and benchmarks/perf-test/Cargo.toml to override to whichever git you want to test with. To use from crates.io you'll also need to change the pinned versions in all other Cargo.toml-s to something like =0.19.2 since 0.19.3-pre isn't on crates.io

Then you can run it like this:

$ cd benchmarks/perf-test
$ cargo run batch-scaling 10000 "some-path/some-name.csv" -o ../../brains/test-large.onnx -b `seq -s, 1 24`

Once you've tested a bunch of PRs you can do

$ python3 ../python/compare_batchsize.py \
       some-path/first.csv,some-path/second.csv,... \
       test1label,test2label,... \
       10000 \
       some-path/output.png

The script should handle any number of comparisons, they're all relative to the first file.
E.g., mine was

python3 ../python/compare_batchsize.py \
      ../out/batchsize-avx256-017.csv,../out/batchsize-avx256.csv,../out/batchsize-avx512.csv,../out/batchsize-avx512-imp.csv \
      avx256-0.17,avx256-0.19.2,avx512,avx512b \
      10000 \
      ../out/batchsize-compare-dynamic.png

@tgolsson
Copy link
Contributor

Nowhere near as pretty, but more importantly worse values than yours:
image

@cchudant
Copy link
Contributor Author

Okay, I did not have that much time to look at this today, but I managed to replicate the regression locally, and looking at vtune, apparently i'm spending wayy too much time on the MMV kernels it looks like?

I was very worried that this was something I wouldnt be able to replicate, seeing that your kernel times are so different. Fortunately it seems that's not the case

I'll look further tomorrow, but something is fishy here and it's my fault :)

@tgolsson
Copy link
Contributor

How easy would it be to remove loop unrolling? We could be cramping the instruction decoder/cache.

@cchudant
Copy link
Contributor Author

Very easy, it's just setting the loop unroll variable to 1 and commenting the jumps
That's a good point, i'll test that when i get back to it

@kali
Copy link
Collaborator

kali commented Apr 18, 2023

Hey folks, I'm planning on cutting 0.20.x in a week or so. Do we have a path towards making these optimisations a part of it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants