Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Igemm experiment #43

Open
wants to merge 6 commits into
base: i32-gemm-experiment
Choose a base branch
from

Conversation

SuperFluffy
Copy link
Contributor

@SuperFluffy SuperFluffy commented Dec 9, 2018

This is a first shot at implementing gemm using i8 as input, and i16 as output.

The main change is to the gemm loops, which are now able to have different types as input and output.

Todo:

  • Adapt integration tests for i8gemm;
  • Fix the masked buffers to work with different kernel dimensions;

@SuperFluffy
Copy link
Contributor Author

SuperFluffy commented Dec 9, 2018

@bluss: I have adjusted gemm_packed to work with kernels of shapes other then 8x8 and 8x4. However, Rust seems to have issues finding the correct associated types and constants of the K: GemmKernel, and I can't figure out, why. The suggested fixes are basically what's already written.

Can you have a look?

 % cargo check

    Checking matrixmultiply v0.2.1 (/home/janis/foreign_git/matrixmultiply)
error[E0599]: no associated item named `MR` found for type `K` in the current scope
   --> src/gemm.rs:213:30
    |
213 |     let mut mask_buf = [0u8; K::MR * K::NR * size_of::<K::ElemIn>() + 31];
    |                              ^^^^^ associated item not found in `K`
    |
    = help: items from traits can only be used if the trait is implemented and in scope
    = note: the following trait defines an item `MR`, perhaps you need to implement it:
            candidate #1: `kernel::GemmKernel`

error[E0599]: no associated item named `NR` found for type `K` in the current scope
   --> src/gemm.rs:213:38
    |
213 |     let mut mask_buf = [0u8; K::MR * K::NR * size_of::<K::ElemIn>() + 31];
    |                                      ^^^^^ associated item not found in `K`
    |
    = help: items from traits can only be used if the trait is implemented and in scope
    = note: the following trait defines an item `NR`, perhaps you need to implement it:
            candidate #1: `kernel::GemmKernel`

error[E0220]: associated type `ElemIn` not found for `K`
   --> src/gemm.rs:213:56
    |
213 |     let mut mask_buf = [0u8; K::MR * K::NR * size_of::<K::ElemIn>() + 31];
    |                                                        ^^^^^^^^^ associated type `ElemIn` not found

error: aborting due to 3 previous errors

Yet just a few lines above I clearly have K: GemmKernel... The extra bounds on the associated types ElemIn and ElemOut are probably redundant. I just added them to help the compiler.

@bluss
Copy link
Owner

bluss commented Dec 9, 2018

@SuperFluffy don't have time to look at it all, but associated constants can not be used in array sizes. Rust can't do that.

@SuperFluffy
Copy link
Contributor Author

SuperFluffy commented Dec 9, 2018

@bluss Thanks for the note, was looking through RFCs. That's another use for const generics, I guess.

I wonder if replacing the buffer by a Vec would be very bad. Will test it.

@SuperFluffy
Copy link
Contributor Author

Wow, for small matrices using a Vec leads to some serious slowdowns:

 name                 buf_mask ns/iter  vec_mask ns/iter  diff ns/iter  diff %  speedup 
 layout_f64_032::ccc  2,187             2,244                       57   2.61%   x 0.97 
 layout_f64_032::ccf  2,150             2,174                       24   1.12%   x 0.99 
 layout_f64_032::cfc  2,371             2,418                       47   1.98%   x 0.98 
 layout_f64_032::cff  2,334             2,361                       27   1.16%   x 0.99 
 layout_f64_032::fcc  2,005             2,048                       43   2.14%   x 0.98 
 layout_f64_032::fcf  1,972             1,991                       19   0.96%   x 0.99 
 layout_f64_032::ffc  2,173             2,216                       43   1.98%   x 0.98 
 layout_f64_032::fff  2,114             2,150                       36   1.70%   x 0.98 
 mat_mul_f64::m004    143               187                         44  30.77%   x 0.76 
 mat_mul_f64::m006    216               249                         33  15.28%   x 0.87 
 mat_mul_f64::m008    225               262                         37  16.44%   x 0.86 
 mat_mul_f64::m012    410               423                         13   3.17%   x 0.97 
 mat_mul_f64::m016    480               517                         37   7.71%   x 0.93 
 mat_mul_f64::m032    2,191             2,215                       24   1.10%   x 0.99 
 mat_mul_f64::m064    14,075            13,805                    -270  -1.92%   x 1.02 
 mat_mul_f64::m127    97,784            98,549                     765   0.78%   x 0.99 

@SuperFluffy
Copy link
Contributor Author

Blowing up the masked buffer to 1024 (16322, kernel is 16x32, i16 takes 2 bytes) elements at least doesn't seem to affect performance:

running 16 tests
test layout_f64_032::ccc ... bench:       2,189 ns/iter (+/- 79)
test layout_f64_032::ccf ... bench:       2,141 ns/iter (+/- 55)
test layout_f64_032::cfc ... bench:       2,373 ns/iter (+/- 72)
test layout_f64_032::cff ... bench:       2,313 ns/iter (+/- 83)
test layout_f64_032::fcc ... bench:       2,015 ns/iter (+/- 60)
test layout_f64_032::fcf ... bench:       1,960 ns/iter (+/- 49)
test layout_f64_032::ffc ... bench:       2,185 ns/iter (+/- 69)
test layout_f64_032::fff ... bench:       2,115 ns/iter (+/- 57)
test mat_mul_f64::m004   ... bench:         154 ns/iter (+/- 6)
test mat_mul_f64::m006   ... bench:         228 ns/iter (+/- 1)
test mat_mul_f64::m008   ... bench:         231 ns/iter (+/- 6)
test mat_mul_f64::m012   ... bench:         419 ns/iter (+/- 13)
test mat_mul_f64::m016   ... bench:         475 ns/iter (+/- 21)
test mat_mul_f64::m032   ... bench:       2,213 ns/iter (+/- 59)
test mat_mul_f64::m064   ... bench:      14,128 ns/iter (+/- 339)
test mat_mul_f64::m127   ... bench:      98,621 ns/iter (+/- 2,590)

It makes me unhappy to do that, but maybe it's ok?

The only other option I see is to make a macro for each *gemm that passes in a buffer size. But that still doesn't solve that ideally we'd have a different buffer size for each kernel type.

@SuperFluffy
Copy link
Contributor Author

This certainly needs more tuning. This is some terrible performance, as of now:

cargo bench i8

running 18 tests
test layout_i8_128::ccc  ... bench:      86,794 ns/iter (+/- 3,997)
test layout_i8_128::ccf  ... bench:      88,833 ns/iter (+/- 4,850)
test layout_i8_128::cfc  ... bench:      93,464 ns/iter (+/- 5,792)
test layout_i8_128::cff  ... bench:      82,951 ns/iter (+/- 2,452)
test layout_i8_128::fcc  ... bench:      75,052 ns/iter (+/- 11,858)
test layout_i8_128::fcf  ... bench:      75,082 ns/iter (+/- 14,399)
test layout_i8_128::ffc  ... bench:      78,170 ns/iter (+/- 12,755)
test layout_i8_128::fff  ... bench:      79,103 ns/iter (+/- 11,437)
test mat_mul_i8::m004    ... bench:         338 ns/iter (+/- 19)
test mat_mul_i8::m006    ... bench:         473 ns/iter (+/- 64)
test mat_mul_i8::m008    ... bench:         591 ns/iter (+/- 50)
test mat_mul_i8::m012    ... bench:         887 ns/iter (+/- 54)
test mat_mul_i8::m016    ... bench:       1,161 ns/iter (+/- 262)
test mat_mul_i8::m032    ... bench:       2,889 ns/iter (+/- 639)
test mat_mul_i8::m064    ... bench:      14,749 ns/iter (+/- 3,077)
test mat_mul_i8::m127    ... bench:      93,135 ns/iter (+/- 11,694)
test mat_mul_i8::m128    ... bench:      95,444 ns/iter (+/- 14,300)
test mat_mul_i8::m255    ... bench:     607,507 ns/iter (+/- 85,093)

@SuperFluffy
Copy link
Contributor Author

The reason for those atrocious numbers is that I probably don't have the number of available vector, ymm*, registers in mind. avx has 16 vector registers in total.

This issue becomes easily obvious by looking at the kernel size I have chosen: in the kernel, I accumulate in an array of size 32. That means that for the accumulation variables alone I need 32 registers. Add on top the column from a and each element in the row from b, and the accumulation step within the unrolled loop needs 65 registers.

The benchmarks above thus most probably measure the CPU loading data in and out of memory.

@bluss
Copy link
Owner

bluss commented Dec 11, 2018

Sounds interesting. :)

I'd propose some targets:

  • rebase/focus on merging with master
  • separate out refactorings that are unrelated to the actual feature
  • use a new crate feature for this. It can be default during development and for docs.rs, but probably not default for standard build

@SuperFluffy
Copy link
Contributor Author

The way forward to implement an efficient integer gemm is to change the packing of the matrices that the kernel accesses. Google's gemmlowp is doing exactly that. An example can be seen here: https://github.com/google/gemmlowp/blob/master/internal/kernel.h#L116-L119

0  1  2  3  4  20 21 22 23 24
5  6  7  8  9  25 26 27 28 29
10 11 12 13 14 30 31 32 33 34
15 16 17 18 19 35 36 37 38 39

Here, two sub blocks (or “cell” in their terminology) of dimensions 4x5 each are placed right next to each other. Each cell is row major in itself.

For an avx2 kernel performing matrix multiplication between two i8 matrices, an optimal kernel would look something like this

    b 0 1 2 3
 a    4 5 6 7
0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

We can load 16 i8 and sign-extend them to 16 i16 into one register. This takes up ymm0 and ymm1 for a and b. The problem is that the outer product between one column, 16x1, and one row, 1x16, requires 16 registers to accumulate their product. We only have 16 registers in total, which is already more than we have available.

An alternative is to consider multiplying 8x2 and 2x8 (two registers total). The intrinsic vpmaddwd allows doing the multiplication and addition in one go, accumulating the result in 32-bit vectors. We'd need 8 of those, so we have now used 10, leaving us with 6. That's too many spare ones (we don't need so many to do the intermediate steps like shuffling, etc). Using two columns, so having a 16x2 panel), would again exceed the number of available registers.

The best option

I think the optimal solution is to do what gemmlowp is doing: multiply 3 8x2 cells of matrix a and one 2x4 cell matrix b. This way, we can accumulate in 3*4 = 12 registers, loading new data from a three times. This takes 14 registers for accumulation and cells, leaving 2 registers to perform intermediate operations in. The main task is hence to pack 8x2 cells such that they are adjacent to each other, and pack 2x4 cells such that they have additional padding of 8 bytes.

@bluss How much of the loops do you think would need to be adapted to permit such packing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants