A simple gemm kernel for sparse convolution. Mainly an implicit GEMM cuda kernel with naive tensor-core. Used following tricks to improve overall runtime
- tensor-core
- float16 arithmetic
- and half2 intrinsic ( hfma2, hmul2, hadd2 )
- software pipelining
- combined memory access ( ldg128, stg128 ). Note that for A100, we should consider using ldgsts intrinsic