Enable exhaustive x86 TPP testing #832

alheinecke · 2023-11-15T05:03:44Z

This PR attempts to fix several transient issues/errors with GEMM CI and adding support for SPR as CI env

- improved allow list for ADL - bug fix for binary on SKX machines

alheinecke · 2023-11-19T22:28:30Z

@egeor, @rengolin : please treat this PR with priority as I had to change also settings on buildkite which conflict with other branchs' testing. So we need to land it ASAP.

In a nutshell these are the ideas after long pondering fixing transient issues we had now for ~1hr (even discussed them in our F2F March'23).

a) added all archs we support on x86, there are a handful of bugfixes in this PR to fix broken arch decision for chip we don't have in our hands (SNB, etc.)
b) for layernorm I ran an exhaustive 100x100x100 grid under GCC and made CI compiler GCC. the largest error was "0.006932419423397524933794", so I set the CI boundary to 0.007, I have full logs from this run if you guys have any better idea.
c) the GEMM test were the toughest nut to crack as they had really bad norms from time to time after playing with several strategies (and none of them led anywhere as we want to freedom to reorder "K" in the our GEMM kernels (we might want to add an env variable to switch that feature off, I will create an issue). So, I settled for the following fix:
I) when the check norm is outrageously high (means close to 1), we look at the eltwise inf-norm, if this is in check we use the inf-norm instead of the check-norm. This is normally the case for M=1-3, N=1-3 and K>60, so accumulation chain related.
II) when this new norm from I) is bigger than 0.009 we assume an accumulation reorder issue. In this case we replay the current test with matrix A set to identity matrix. B and C will remain fully random. When now the norm is still bigger than 0.009 we flag the test as failure, if it was acc-related it should be 0.0 and therefore the test automatically passes and we move on to the next test case.

My ask now:
a) we need agreement that strategy c) from above is save and sound.
b) please check my logic in gemm_kernel.c for the retry very carefully

rengolin

While I cannot comment on the instruction sequence and rerun strategy (I'll leave that for @egeor), I think it makes sense to try to control the execution and understand what it happening because the inputs are random and we don't control the source.

One way we tried to control this in tpp-mlir is to generate a "reasonable" random tensor by getting a normal distribution around 0.2 and small standard deviation, clamped at [-1.0, 1.0]. We stopped having random crashes (infs and nans etc), but it doesn't "fix" everything, so I think what you're trying to do is still a valid strategy.

I did add some comments on the code, but it may be just my poor understanding of the original code, so take them with a bit of salt.

samples/xgemm/gemm_kernel.c

src/generator_common_x86.c

alheinecke · 2023-11-20T00:10:10Z

One way we tried to control this in tpp-mlir is to generate a "reasonable" random tensor by getting a normal distribution around 0.2 and small standard deviation, clamped at [-1.0, 1.0]. We stopped having random crashes (infs and nans etc), but it doesn't "fix" everything, so I think what you're trying to do is still a valid strategy.

Yepp, that we have for a long time and this is not about NaN/Inf etc. It's about having the tightest possible thresholds in magic numbers, especially for low precision cases (sgemm/dgemm where never affected), mainly for BF8/int8/int4/f16. For some degenerated cases (like M=N=1 and K very long), we see depending on compiler and datatype errors of up to 2.0 in matrix norms. However, they are not real as they are floating point accumulation errors, but with output matrix size of 1 these norms exploded (hence we go to max value in inf_norm which is max. error in an eltwise comparison). For cases with moderate M and N and K still very long, now the matrix norm is equal to max inf norm... so we cannot do the same trick, e.g. matrix norm is 0.09 and max inf is 0.009... In this case we are now setting A to identity (with this PR in the retry) as this removes any reorder in the accumulation chain. If we now get lower numbers we assume there is no bug.

The tests here are much more challenging to control than tpp-mlir as we run ~10 Mio (~60K per precision combination and arch) TPP executions alone for GEMM on randomly drawn shapes between 1x1x1 and 100x100x100 with random values. When you go back in the commit history in this PR, you see I disabled K-reorder and everything passed right away. So the goal with retry is that we want to enable K-reorder in unit tests, but not added 100+ magic numbers for shapes and datatypes we analyzed and setting the passing bar to 0.1 either... so shape based thresholds are not scalable. This PR "suppresses" such outliers automatically, by taking in this case the error norm of the multiplication where A was the identity matrix.

future

egeor

Looks good to me (looked carefully the retry logic in the gemm driver as well). This seems for now the best option given the idiosyncrasies of the acc chain reorders happening in the kernel. The only other way I can think of (as also we discussed privately) to deal with this without any specific init and retry would be to : 1) a way to query the library if a kernel triggers the k-reorder/multiple accs code gen, 2) the driver would in this case invoke a reference code that observes the multiple accumulators strategy/k acc. ordering. Still this strategy is brittle...

alheinecke added 2 commits November 14, 2023 19:05

added loser boundary for bf8f16f16f16 GEMM

fc57b2b

catching corner cases

c1c987b

alheinecke requested a review from egeor November 15, 2023 05:03

alheinecke added 16 commits November 14, 2023 21:17

more losened thresholds.... not happy

ea3ccd9

tightend acc threshold somewhat

14025c2

more threshold changes, added CPX & SPR allow lists

85922a5

threshold adjustment for brgemm

d99ec8b

added one more threshold case

1b173a2

fixing convert condition on CPX VL=256

48f19b2

slightly looser threshhold for bf8f16f16f32 gemm

008ee12

temp. disabling reorder of inner products

51ad17a

updated threshold

c99d136

updated threshold

a42109f

fixed eltwise CI

6ca2f26

- improved allow list for ADL - bug fix for binary on SKX machines

switched to spr-all parition for CI

64f3ee8

fixed arch boundary in unary test

18f1f37

SNB is WSM for eltwise

a4defa7

fixed SNB GEMM with fusion

28d2775

attempt to fix issues by addind retrys setting A to identity

a4a5acf

alheinecke changed the title ~~Fix GEMM CI~~ Enable exhaustive x86 TPP testing Nov 18, 2023

alheinecke added 5 commits November 18, 2023 00:33

more boundaries

b296f7c

update to retry logic

ff0def1

updated layernorm threshold

363cb8c

updated threshold

c54de8f

added scripts for bf16 / f16f16f132 for SRF and GNR testing

b60fd6a

alheinecke marked this pull request as ready for review November 19, 2023 22:12

alheinecke requested a review from rengolin November 19, 2023 22:28

rengolin reviewed Nov 19, 2023

View reviewed changes

samples/xgemm/gemm_kernel.c Show resolved Hide resolved

samples/xgemm/gemm_kernel.c Show resolved Hide resolved

samples/xgemm/gemm_kernel.c Show resolved Hide resolved

samples/xgemm/gemm_kernel.c Show resolved Hide resolved

src/generator_common_x86.c Show resolved Hide resolved

alheinecke mentioned this pull request Nov 20, 2023

Fix Testing Infrastructure for eltwise,equation and GEMM TPPs #714

Open

added TODO for adding some non void return types to some functions in

07a67f5

future

egeor approved these changes Nov 20, 2023

View reviewed changes

alheinecke merged commit 0d9be90 into main Nov 20, 2023
3 checks passed

alheinecke deleted the fix_gemm_ci branch November 20, 2023 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable exhaustive x86 TPP testing #832

Enable exhaustive x86 TPP testing #832

alheinecke commented Nov 15, 2023

alheinecke commented Nov 19, 2023

rengolin left a comment

alheinecke commented Nov 20, 2023

egeor left a comment

Enable exhaustive x86 TPP testing #832

Enable exhaustive x86 TPP testing #832

Conversation

alheinecke commented Nov 15, 2023

alheinecke commented Nov 19, 2023

rengolin left a comment

Choose a reason for hiding this comment

alheinecke commented Nov 20, 2023

egeor left a comment

Choose a reason for hiding this comment