[TOPI] VNNI support for int8 dense #10230

masahi · 2022-02-11T23:11:21Z

I started off with the test code in

tvm/tests/python/contrib/test_gemm_acc32_vnni.py

Line 30 in 720e7b1

def test_fc_int8_acc32():

and simplified a bit. Autotvm tuning is supported but only one tunable param is exposed for now. I'm curious to know what kind of further scheduling would be worth it, beyond very simple ones I have now.

Moreover, since I rely on alter op layout to enable this op (see

tvm/python/tvm/topi/x86/dense_alter_op.py

Lines 40 to 49 in 35f6bb1

    
           if ( 
        
               target_has_vnni(mcpu) 
        
               and data_tensor.dtype == "uint8" 
        
               and weight_tensor.dtype == "int8" 
        
               and weight_tensor.shape[0] % 16 == 0 
        
               and weight_tensor.shape[1] % 4 == 0 
        
           ): 
        
               # TODO(masahi): Support int8 x int8 case 
        
               weight_layout = "NC16n4c" 
        
               return relay.nn.contrib_dense_pack(inputs[0], inputs[1], weight_layout, None, out_dtype)

) and currently this pass is disabled during autotvm (see

tvm/python/tvm/autotvm/task/relay_integration.py

Lines 52 to 54 in 187aeb5

    
           # Alter op layout code has been written expecting that tuning is applied 
        
           # without it, so we disable AlterOpLayout to maintain that behavior. 
        
           with tvm.transform.PassContext(opt_level=opt_level, disabled_pass={"AlterOpLayout"}):

and #10171), users need to manually invoke AlterOpLayout before extracting tasks. I refuse to add an ugly code path to workaround this strange issue like existing code cc @tkonolige.

cc @vinx13 @junrushao1994 @mbrookhart @tkonolige @elvin-n

Current perf results (also see more results in #10230 (comment))

Compare against FBGEMM using their bench exe https://github.com/pytorch/FBGEMM/blob/main/bench/GEMMsBenchmark.cc

The CPU is tigerlake i7-1195G7 @ 2.90GHz, all numbers are giga ops per sec (GOPS).

I didn't spend much on perf tuning, but the results look promising. Perf on bigger workloads don't look great, might need further investigation.

Also, I found that autotvm tuning (only one knob) helped on some single threaded perf, but it didn't on multi-threaded perf at all.

Single thread

M	N	K	TVM	FBGEMM
64	800	320	409.87859448066854	267.4
64	768	512	393.2796314153642	355.7
16	256	512	402.0798502998604	95.5
128	128	128	287.1147659800918	99.0
256	512	256	356.8404055290004	344.1
1024	1024	1024	362.241169902788	452.2
128	768	3072	242.5201549349005	423.1
128	768	768	369.1912270185436	371.0
128	3072	768	276.460384879211	398.9

4 threads

M	N	K	TVM	FBGEMM
64	800	320	758.5609768620845	191.6
64	768	512	760.5891303600133	326.5
16	256	512	673.6151819351879	39.0
128	128	128	676.2890321415558	42.0
256	512	256	707.0203324200173	375.0
1024	1024	1024	690.8657213907054	1609.7
128	768	3072	510.46516174131585	763.7
128	768	768	679.096063909271	658.1
128	3072	768	659.4564135824261	835.1

junrushao · 2022-02-11T23:29:58Z

CC: @vinx13 @spectrometerHBH @jinhongyii

elvin-n

LGTM

masahi · 2022-02-14T21:35:05Z

Another perf result, this time on a desktop CPU i5-11400 @ 2.60GHz, 6 threads.

TVM showing excellent performance!

M	N	K	TVM	FBGEMM
64	800	320	2254.902217150772	259.2
64	768	512	2459.8222476184296	485.3
16	256	512	1511.1102144223107	59.8
128	128	128	1655.3361580672251	57.0
256	512	256	2487.655260525414	573.6
1024	1024	1024	2604.4250609008964	2520.1
128	768	3072	1846.407818035626	1864.0
128	768	768	2579.0514976284917	1012.5
128	3072	768	2571.1070749811097	1900.8

junrushao

Thanks! This is just amazing!!

* wip * revert for now * simplify blocking * add bench script * update type rel * refactor tests * end to end compilation working * paralleize outer loop * add shape check * fused schedule first cut * restore original test * black * add vnni check * add relay test * skip on ci * check dtype * lint * make it tunable * minor cleanup

masahi added 18 commits February 13, 2022 11:23

wip

7751352

revert for now

47c65f2

simplify blocking

50a6207

add bench script

15a2c5c

update type rel

274360b

refactor tests

c726482

end to end compilation working

a1bcb04

paralleize outer loop

1d61d52

add shape check

3c93283

fused schedule first cut

5a82e1f

restore original test

4be2c8a

black

9d4f161

add vnni check

eb47404

add relay test

1aac044

skip on ci

7840a9e

check dtype

a786bde

lint

f7de3bf

make it tunable

cbfe979

masahi force-pushed the dense-vnni branch from accd6cd to cbfe979 Compare February 13, 2022 02:23

minor cleanup

35f6bb1

masahi marked this pull request as ready for review February 14, 2022 00:19

masahi requested review from Laurawly, Huyuwei, kevinthesun, jwfromm, vinx13, yzhliu, mbrookhart and ZihengJiang as code owners February 14, 2022 00:19

masahi requested review from jcf94, jroesch, slyubomirsky, icemelon, MarisaKirisame, zhiics, anijain2305, wweic, junrushao, comaniac, tqchen, areusch and merrymercy as code owners February 14, 2022 00:19

elvin-n approved these changes Feb 14, 2022

View reviewed changes

junrushao approved these changes Feb 14, 2022

View reviewed changes

github-actions bot requested a review from junrushao February 14, 2022 21:37

masahi merged commit 0009a30 into apache:main Feb 15, 2022

masahi mentioned this pull request Feb 21, 2022

[TOPI] VNNI support for batch matmul #10332

Merged

masahi mentioned this pull request Jun 14, 2022

Fix AutoTVM int8 vnni dense task extraction problem #11699

Closed

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] VNNI support for int8 dense #10230

[TOPI] VNNI support for int8 dense #10230

masahi commented Feb 11, 2022 •

edited

Loading

junrushao commented Feb 11, 2022

elvin-n left a comment

masahi commented Feb 14, 2022

junrushao left a comment

	if (
	target_has_vnni(mcpu)
	and data_tensor.dtype == "uint8"
	and weight_tensor.dtype == "int8"
	and weight_tensor.shape[0] % 16 == 0
	and weight_tensor.shape[1] % 4 == 0
	):
	# TODO(masahi): Support int8 x int8 case
	weight_layout = "NC16n4c"
	return relay.nn.contrib_dense_pack(inputs[0], inputs[1], weight_layout, None, out_dtype)

	# Alter op layout code has been written expecting that tuning is applied
	# without it, so we disable AlterOpLayout to maintain that behavior.
	with tvm.transform.PassContext(opt_level=opt_level, disabled_pass={"AlterOpLayout"}):

[TOPI] VNNI support for int8 dense #10230

[TOPI] VNNI support for int8 dense #10230

Conversation

masahi commented Feb 11, 2022 • edited Loading

Current perf results (also see more results in #10230 (comment))

junrushao commented Feb 11, 2022

elvin-n left a comment

Choose a reason for hiding this comment

masahi commented Feb 14, 2022

junrushao left a comment

Choose a reason for hiding this comment

masahi commented Feb 11, 2022 •

edited

Loading