[MatMul] Update generated code after memory index hoisting #1974

shmsong · 2022-09-13T20:10:27Z

All changes in this PR are minor tweaking on the generated cuda code to get the intended assembly code after compilation.

Here is a list of minor codegen tweaking grouped into this PR: (more details see comments and internal doc)

1.Double Buffer Swtich to ensure UR usage in double buffered indexing

Before transform:

for i in ... // double buffer loop:
  .. = ld.shared [... + i%3 *double_buffer_size]

After transform:

double_buffer_switch=0;
for i in ... // double buffer loop:
  .. = ld.shared [... + double_buffer_switch]
  double_buffer_switch = update(double_buffer_switch, double_buffer_size);

Double Buffer Update to save register when UR is not available, this costs more instructions.

Before transform:

for i in ... // double buffer loop:
  .. = ld.shared [... + i%3 *double_buffer_size]

After transform:

for j in ...
  R[j] = ...
for i in ... // double buffer loop:
  .. = ld.shared [R[J]]
  for j in ...
     R[j]= update(R[j], double_buffer_size);

3.Casting lifted component to byte pointer: lifts some instructions and regs out of main loop

Before transform:

nvfuser_index_t base = ...
for i in ... // main loop
  .. = ld.global &T0[base+123]

After transform:

char* base = &T0[[...]
for i in ... // main loop
  .. = ld.global base+123*sizeof(T0.dtype)

4.Increment gmem pointer: lifts some more instructions and regs out of main loop

Before transform:

char* base = ...
for i in ... // main loop
  .. = ld.global base+123* i

After transform:

char* base = ...
for i in ... // main loop
  .. = ld.global base
  base+=123

5.Decrement gmem pointer: improves the instruction schedule
Before transform:

char* base = ...
for i in ... // main loop
  .. = ld.global base
  base+=123

After transform:

char* base = ...
base -=123
for i in ... // main loop
  base+=123
  .. = ld.global base

lift cvta out of main loop for cp.async: ensures usage of immediate field.
Before transform:

char* smem_ptr = ...
for i in ... // main loop
 cp.async smem_ptr+123, ...

After transform:

char* smem_ptr = ...
unsigned smem_address = cvta(smem_ptr);
for i in ... // main loop
 cp.async smem_address +123, ...

naoyam · 2022-09-22T19:34:57Z

@shmsong Just skimmed through the changes. As far as I cans see, there's nothing fundamentally new analysis in this PR, but all of the changes are more like localized tweaking of generated codes as you explained in the above comment. In the interest of the time, I'll prioritize the other PRs to review.

shmsong added 17 commits August 22, 2022 16:35

add base address field in tensor index

3340b8f

add pointer data type

00a53b4

codegen for base address option

b349586

pointer mod take 1

f995fd1

minor update

84d8f04

(wip) increment mode

250e46b

Merge branch 'predicate_shift' into index_codegen

906edb4

lift read db index

ce3f1e1

inplace write double buffer update

16e9c4a

lift cvta out of main loop

5bdeea2

increment gmem load

578dcfe

[hack] decrement index

9f731a3

rebase fix

334e81a

clean up

139368b

comment ; cleanup

211cc5c

minor fix

b623514

minor fix

add6fec

shmsong changed the title ~~WIP: [Not ready for Review] Update generated code after memory index hoisting~~ Update generated code after memory index hoisting Sep 20, 2022

zasdfgbnm added 3 commits September 29, 2022 21:28

Merge branch 'predicate_shift-rebase' into index_codegen-rebase

6bf40cf

fix

78a80f7

Merge branch 'predicate_shift' into index_codegen

c277942

csarofeen changed the title ~~Update generated code after memory index hoisting~~ [MatMul] Update generated code after memory index hoisting Oct 19, 2022

zasdfgbnm mentioned this pull request Dec 19, 2022

cp.async access global tensor via pointer #2282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MatMul] Update generated code after memory index hoisting #1974

[MatMul] Update generated code after memory index hoisting #1974

shmsong commented Sep 13, 2022 •

edited

Loading

naoyam commented Sep 22, 2022

[MatMul] Update generated code after memory index hoisting #1974

Are you sure you want to change the base?

[MatMul] Update generated code after memory index hoisting #1974

Conversation

shmsong commented Sep 13, 2022 • edited Loading

naoyam commented Sep 22, 2022

shmsong commented Sep 13, 2022 •

edited

Loading