Use int for int8x4 due to performance overhead of char4 #1569

vinx13 · 2018-08-08T05:46:17Z

Loading four int8 elements as char4 is likely to produce more integer instructions. When we use int8 intrinsics (e.g. dp4a), we need packed 32-bit data, which need extra operations for packing int8 elements.

For example, below is a ptx code snippet of
__dp4a((( char4*)(( signed char*)A_shared_local + ((k_inner_outer_outer % 2) * 32)))[0], (( char4*)(( signed char*)B_shared_local + ((k_inner_outer_outer % 2) * 32)))[0], C_local[0]);

ld.shared.v4.u8 {%rs577, %rs578, %rs579, %rs580}, [%r5+24];
ld.shared.v4.u8 {%rs625, %rs626, %rs627, %rs628}, [%r6+24];
...
cvt.u32.u16 %r2873, %rs580;
mul.wide.u16 %r2874, %rs578, 256;
cvt.u32.u16 %r2875, %rs577;
cvt.u32.u16 %r2876, %rs579;
prmt.b32 %r2877, %r2876, %r2875, 28756;
prmt.b32 %r2878, %r2873, %r2877, 1620;
or.b32 %r2010, %r2878, %r2874;

cvt.u32.u16 %r2879, %rs628;
mul.wide.u16 %r2880, %rs626, 256;
cvt.u32.u16 %r2881, %rs625;
cvt.u32.u16 %r2882, %rs627;
prmt.b32 %r2883, %r2882, %r2881, 28756;
prmt.b32 %r2884, %r2879, %r2883, 1620;
or.b32 %r1815, %r2884, %r2880;

dp4a.s32.s32 %r1785, %r2010, %r1815, %r1529;

We would like to use ld.shared.u32 in this case so that 32-bit data can be directly loaded.

This disables support for vectorized int8 arithmetic operations. Since these operations are used in few cases, we prefer better performance here.

vinx13 · 2018-08-08T05:48:32Z

@tqchen Please review.

tqchen · 2018-08-08T16:55:12Z

cc @nishi-t

tqchen · 2018-08-08T16:55:42Z

src/codegen/codegen_cuda.cc

@@ -90,7 +90,7 @@ void CodeGenCUDA::PrintType(Type t, std::ostream& os) {  // NOLINT(*)
        if (t.lanes() == 4) {
          // directly 4 8 bit int in integer.
          enable_int8_ = true;
-          os << "char4"; return;
+          os << "int"; return;


Please add a comment block here on why are we making this choice, so people won't change it back

nishi-t · 2018-08-09T03:05:40Z

vectorized_add test will not work for int8 anymore. Please remove this:

check_cuda("int8", 64, 4)

https://github.com/dmlc/tvm/blob/master/tests/python/unittest/test_codegen_cuda.py#L34

vinx13 · 2018-08-09T03:35:13Z

@tqchen Please review again.

tqchen requested changes Aug 8, 2018

View reviewed changes

tqchen added status: review in progress status: need update need update based on feedbacks and removed status: review in progress labels Aug 8, 2018

vinx13 added 3 commits August 9, 2018 11:10

Use int for int8x4 due to performance overhead of char4

87cc7c7

Add a comment about using int

68f50e3

Remove invalid test

112c633

vinx13 force-pushed the feature/int_for_int8x4 branch from 27e14ab to 112c633 Compare August 9, 2018 03:11

tqchen approved these changes Aug 9, 2018

View reviewed changes

tqchen merged commit 41d4dd6 into apache:master Aug 9, 2018

llehtahw mentioned this pull request Sep 10, 2019

Fix CUDA int8x4 vectorize #3928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use int for int8x4 due to performance overhead of char4 #1569

Use int for int8x4 due to performance overhead of char4 #1569

vinx13 commented Aug 8, 2018

vinx13 commented Aug 8, 2018

tqchen commented Aug 8, 2018

tqchen Aug 8, 2018

nishi-t commented Aug 9, 2018

vinx13 commented Aug 9, 2018

Use int for int8x4 due to performance overhead of char4 #1569

Use int for int8x4 due to performance overhead of char4 #1569

Conversation

vinx13 commented Aug 8, 2018

vinx13 commented Aug 8, 2018

tqchen commented Aug 8, 2018

tqchen Aug 8, 2018

Choose a reason for hiding this comment

nishi-t commented Aug 9, 2018

vinx13 commented Aug 9, 2018