-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Igemm experiment #43
base: i32-gemm-experiment
Are you sure you want to change the base?
Igemm experiment #43
Conversation
@bluss: I have adjusted Can you have a look?
Yet just a few lines above I clearly have |
@SuperFluffy don't have time to look at it all, but associated constants can not be used in array sizes. Rust can't do that. |
@bluss Thanks for the note, was looking through RFCs. That's another use for const generics, I guess. I wonder if replacing the buffer by a |
Wow, for small matrices using a
|
Blowing up the masked buffer to 1024 (16322, kernel is 16x32, i16 takes 2 bytes) elements at least doesn't seem to affect performance:
It makes me unhappy to do that, but maybe it's ok? The only other option I see is to make a macro for each |
This certainly needs more tuning. This is some terrible performance, as of now:
|
The reason for those atrocious numbers is that I probably don't have the number of available vector, This issue becomes easily obvious by looking at the kernel size I have chosen: in the kernel, I accumulate in an array of size 32. That means that for the accumulation variables alone I need 32 registers. Add on top the column from a and each element in the row from b, and the accumulation step within the unrolled loop needs 65 registers. The benchmarks above thus most probably measure the CPU loading data in and out of memory. |
Sounds interesting. :) I'd propose some targets:
|
The way forward to implement an efficient integer gemm is to change the packing of the matrices that the kernel accesses. Google's gemmlowp is doing exactly that. An example can be seen here: https://github.com/google/gemmlowp/blob/master/internal/kernel.h#L116-L119
Here, two sub blocks (or “cell” in their terminology) of dimensions 4x5 each are placed right next to each other. Each cell is row major in itself. For an avx2 kernel performing matrix multiplication between two
We can load 16 An alternative is to consider multiplying 8x2 and 2x8 (two registers total). The intrinsic The best optionI think the optimal solution is to do what gemmlowp is doing: multiply 3 8x2 cells of matrix @bluss How much of the loops do you think would need to be adapted to permit such packing? |
This is a first shot at implementing
gemm
usingi8
as input, andi16
as output.The main change is to the gemm loops, which are now able to have different types as input and output.
Todo:
i8gemm
;