-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable exhaustive x86 TPP testing #832
Conversation
- improved allow list for ADL - bug fix for binary on SKX machines
@egeor, @rengolin : please treat this PR with priority as I had to change also settings on buildkite which conflict with other branchs' testing. So we need to land it ASAP. In a nutshell these are the ideas after long pondering fixing transient issues we had now for ~1hr (even discussed them in our F2F March'23). a) added all archs we support on x86, there are a handful of bugfixes in this PR to fix broken arch decision for chip we don't have in our hands (SNB, etc.) My ask now: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I cannot comment on the instruction sequence and rerun strategy (I'll leave that for @egeor), I think it makes sense to try to control the execution and understand what it happening because the inputs are random and we don't control the source.
One way we tried to control this in tpp-mlir
is to generate a "reasonable" random tensor by getting a normal distribution around 0.2
and small standard deviation, clamped at [-1.0, 1.0]
. We stopped having random crashes (inf
s and nan
s etc), but it doesn't "fix" everything, so I think what you're trying to do is still a valid strategy.
I did add some comments on the code, but it may be just my poor understanding of the original code, so take them with a bit of salt.
Yepp, that we have for a long time and this is not about NaN/Inf etc. It's about having the tightest possible thresholds in magic numbers, especially for low precision cases (sgemm/dgemm where never affected), mainly for BF8/int8/int4/f16. For some degenerated cases (like M=N=1 and K very long), we see depending on compiler and datatype errors of up to 2.0 in matrix norms. However, they are not real as they are floating point accumulation errors, but with output matrix size of 1 these norms exploded (hence we go to max value in inf_norm which is max. error in an eltwise comparison). For cases with moderate M and N and K still very long, now the matrix norm is equal to max inf norm... so we cannot do the same trick, e.g. matrix norm is 0.09 and max inf is 0.009... In this case we are now setting A to identity (with this PR in the retry) as this removes any reorder in the accumulation chain. If we now get lower numbers we assume there is no bug. The tests here are much more challenging to control than tpp-mlir as we run ~10 Mio (~60K per precision combination and arch) TPP executions alone for GEMM on randomly drawn shapes between 1x1x1 and 100x100x100 with random values. When you go back in the commit history in this PR, you see I disabled K-reorder and everything passed right away. So the goal with retry is that we want to enable K-reorder in unit tests, but not added 100+ magic numbers for shapes and datatypes we analyzed and setting the passing bar to 0.1 either... so shape based thresholds are not scalable. This PR "suppresses" such outliers automatically, by taking in this case the error norm of the multiplication where A was the identity matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me (looked carefully the retry logic in the gemm driver as well). This seems for now the best option given the idiosyncrasies of the acc chain reorders happening in the kernel. The only other way I can think of (as also we discussed privately) to deal with this without any specific init and retry would be to : 1) a way to query the library if a kernel triggers the k-reorder/multiple accs code gen, 2) the driver would in this case invoke a reference code that observes the multiple accumulators strategy/k acc. ordering. Still this strategy is brittle...
This PR attempts to fix several transient issues/errors with GEMM CI and adding support for SPR as CI env