Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable exhaustive x86 TPP testing #832

Merged
merged 24 commits into from
Nov 20, 2023
Merged

Enable exhaustive x86 TPP testing #832

merged 24 commits into from
Nov 20, 2023

Conversation

alheinecke
Copy link
Collaborator

This PR attempts to fix several transient issues/errors with GEMM CI and adding support for SPR as CI env

@alheinecke alheinecke requested a review from egeor November 15, 2023 05:03
@alheinecke alheinecke changed the title Fix GEMM CI Enable exhaustive x86 TPP testing Nov 18, 2023
@alheinecke alheinecke marked this pull request as ready for review November 19, 2023 22:12
@alheinecke
Copy link
Collaborator Author

@egeor, @rengolin : please treat this PR with priority as I had to change also settings on buildkite which conflict with other branchs' testing. So we need to land it ASAP.

In a nutshell these are the ideas after long pondering fixing transient issues we had now for ~1hr (even discussed them in our F2F March'23).

a) added all archs we support on x86, there are a handful of bugfixes in this PR to fix broken arch decision for chip we don't have in our hands (SNB, etc.)
b) for layernorm I ran an exhaustive 100x100x100 grid under GCC and made CI compiler GCC. the largest error was "0.006932419423397524933794", so I set the CI boundary to 0.007, I have full logs from this run if you guys have any better idea.
c) the GEMM test were the toughest nut to crack as they had really bad norms from time to time after playing with several strategies (and none of them led anywhere as we want to freedom to reorder "K" in the our GEMM kernels (we might want to add an env variable to switch that feature off, I will create an issue). So, I settled for the following fix:
I) when the check norm is outrageously high (means close to 1), we look at the eltwise inf-norm, if this is in check we use the inf-norm instead of the check-norm. This is normally the case for M=1-3, N=1-3 and K>60, so accumulation chain related.
II) when this new norm from I) is bigger than 0.009 we assume an accumulation reorder issue. In this case we replay the current test with matrix A set to identity matrix. B and C will remain fully random. When now the norm is still bigger than 0.009 we flag the test as failure, if it was acc-related it should be 0.0 and therefore the test automatically passes and we move on to the next test case.

My ask now:
a) we need agreement that strategy c) from above is save and sound.
b) please check my logic in gemm_kernel.c for the retry very carefully

@alheinecke alheinecke requested a review from rengolin November 19, 2023 22:28
Copy link
Contributor

@rengolin rengolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I cannot comment on the instruction sequence and rerun strategy (I'll leave that for @egeor), I think it makes sense to try to control the execution and understand what it happening because the inputs are random and we don't control the source.

One way we tried to control this in tpp-mlir is to generate a "reasonable" random tensor by getting a normal distribution around 0.2 and small standard deviation, clamped at [-1.0, 1.0]. We stopped having random crashes (infs and nans etc), but it doesn't "fix" everything, so I think what you're trying to do is still a valid strategy.

I did add some comments on the code, but it may be just my poor understanding of the original code, so take them with a bit of salt.

samples/xgemm/gemm_kernel.c Show resolved Hide resolved
samples/xgemm/gemm_kernel.c Show resolved Hide resolved
samples/xgemm/gemm_kernel.c Show resolved Hide resolved
samples/xgemm/gemm_kernel.c Show resolved Hide resolved
src/generator_common_x86.c Show resolved Hide resolved
@alheinecke
Copy link
Collaborator Author

One way we tried to control this in tpp-mlir is to generate a "reasonable" random tensor by getting a normal distribution around 0.2 and small standard deviation, clamped at [-1.0, 1.0]. We stopped having random crashes (infs and nans etc), but it doesn't "fix" everything, so I think what you're trying to do is still a valid strategy.

Yepp, that we have for a long time and this is not about NaN/Inf etc. It's about having the tightest possible thresholds in magic numbers, especially for low precision cases (sgemm/dgemm where never affected), mainly for BF8/int8/int4/f16. For some degenerated cases (like M=N=1 and K very long), we see depending on compiler and datatype errors of up to 2.0 in matrix norms. However, they are not real as they are floating point accumulation errors, but with output matrix size of 1 these norms exploded (hence we go to max value in inf_norm which is max. error in an eltwise comparison). For cases with moderate M and N and K still very long, now the matrix norm is equal to max inf norm... so we cannot do the same trick, e.g. matrix norm is 0.09 and max inf is 0.009... In this case we are now setting A to identity (with this PR in the retry) as this removes any reorder in the accumulation chain. If we now get lower numbers we assume there is no bug.

The tests here are much more challenging to control than tpp-mlir as we run ~10 Mio (~60K per precision combination and arch) TPP executions alone for GEMM on randomly drawn shapes between 1x1x1 and 100x100x100 with random values. When you go back in the commit history in this PR, you see I disabled K-reorder and everything passed right away. So the goal with retry is that we want to enable K-reorder in unit tests, but not added 100+ magic numbers for shapes and datatypes we analyzed and setting the passing bar to 0.1 either... so shape based thresholds are not scalable. This PR "suppresses" such outliers automatically, by taking in this case the error norm of the multiplication where A was the identity matrix.

Copy link
Collaborator

@egeor egeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (looked carefully the retry logic in the gemm driver as well). This seems for now the best option given the idiosyncrasies of the acc chain reorders happening in the kernel. The only other way I can think of (as also we discussed privately) to deal with this without any specific init and retry would be to : 1) a way to query the library if a kernel triggers the k-reorder/multiple accs code gen, 2) the driver would in this case invoke a reference code that observes the multiple accumulators strategy/k acc. ordering. Still this strategy is brittle...

@alheinecke alheinecke merged commit 0d9be90 into main Nov 20, 2023
3 checks passed
@alheinecke alheinecke deleted the fix_gemm_ci branch November 20, 2023 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants