Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MatMul] Update generated code after memory index hoisting #1974

Open
wants to merge 20 commits into
base: predicate_shift
Choose a base branch
from

Conversation

shmsong
Copy link

@shmsong shmsong commented Sep 13, 2022

All changes in this PR are minor tweaking on the generated cuda code to get the intended assembly code after compilation.

Here is a list of minor codegen tweaking grouped into this PR: (more details see comments and internal doc)


1.Double Buffer Swtich to ensure UR usage in double buffered indexing

Before transform:

for i in ... // double buffer loop:
  .. = ld.shared [... + i%3 *double_buffer_size]

After transform:

double_buffer_switch=0;
for i in ... // double buffer loop:
  .. = ld.shared [... + double_buffer_switch]
  double_buffer_switch = update(double_buffer_switch, double_buffer_size);

  1. Double Buffer Update to save register when UR is not available, this costs more instructions.

Before transform:

for i in ... // double buffer loop:
  .. = ld.shared [... + i%3 *double_buffer_size]

After transform:

for j in ...
  R[j] = ...
for i in ... // double buffer loop:
  .. = ld.shared [R[J]]
  for j in ...
     R[j]= update(R[j], double_buffer_size);

3.Casting lifted component to byte pointer: lifts some instructions and regs out of main loop

Before transform:

nvfuser_index_t base = ...
for i in ... // main loop
  .. = ld.global &T0[base+123]

After transform:

char* base = &T0[[...]
for i in ... // main loop
  .. = ld.global base+123*sizeof(T0.dtype)

4.Increment gmem pointer: lifts some more instructions and regs out of main loop

Before transform:

char* base = ...
for i in ... // main loop
  .. = ld.global base+123* i

After transform:

char* base = ...
for i in ... // main loop
  .. = ld.global base
  base+=123

5.Decrement gmem pointer: improves the instruction schedule
Before transform:

char* base = ...
for i in ... // main loop
  .. = ld.global base
  base+=123

After transform:

char* base = ...
base -=123
for i in ... // main loop
  base+=123
  .. = ld.global base

  1. lift cvta out of main loop for cp.async: ensures usage of immediate field.
    Before transform:
char* smem_ptr = ...
for i in ... // main loop
 cp.async smem_ptr+123, ...

After transform:

char* smem_ptr = ...
unsigned smem_address = cvta(smem_ptr);
for i in ... // main loop
 cp.async smem_address +123, ...

@shmsong shmsong changed the title WIP: [Not ready for Review] Update generated code after memory index hoisting Update generated code after memory index hoisting Sep 20, 2022
@naoyam
Copy link
Collaborator

naoyam commented Sep 22, 2022

@shmsong Just skimmed through the changes. As far as I cans see, there's nothing fundamentally new analysis in this PR, but all of the changes are more like localized tweaking of generated codes as you explained in the above comment. In the interest of the time, I'll prioritize the other PRs to review.

@csarofeen csarofeen changed the title Update generated code after memory index hoisting [MatMul] Update generated code after memory index hoisting Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants