Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PASS] InjectDoubleBuffer #405

Merged
merged 1 commit into from
Sep 1, 2017
Merged

[PASS] InjectDoubleBuffer #405

merged 1 commit into from
Sep 1, 2017

Conversation

tqchen
Copy link
Member

@tqchen tqchen commented Sep 1, 2017

This enables double buffering pre-fetching. Could be useful shared memory pre-fetching. One advantage of double buffering is that the logic explicit prefetchs next stage's input to the shared memory buffer.

Source

for (i, 0, 100) {
  allocate B[float32 * 4]
  for (i, 0, 4) {
    B[i] = A[((i*4) + i)]
  }
  for (i, 0, 4) {
    A[i] = (B[i] + 1.000000f)
  }
}

Target

allocate B[float32 * 2 * 4]
for (i, 0, 4) {
  B[i] = A[i]
}
for (i, 0, 99) { 
  // prefetch next iteration
  for (i, 0, 4) {
    B[((((i + 1) % 2)*4) + i)] = A[(((i*4) + i) + 4)]
  }
  for (i, 0, 4) {
    A[i] = (B[(((i % 2)*4) + i)] + 1.000000f)
  }
}
for (i, 0, 4) {
  A[i] = (B[(i + 4)] + 1.000000f)
}

Note

Usually when GPU fetches memory, there is a big latency before the data arrives. There are two ways to hide this cost:

  • Context switch to another GPU thread on the same block, this requires us to launch many GPU threads, limiting the resources(registers) used on each block
  • Do double buffering, to prefetch the data needed in next iteration.

There is a tradeoff here. Bigger tiles means more resources(registers) and more reuse, but harder to hide loading cost (because we launch less threads). Smaller tiles means more threads and easier to hide loading cost but less reuse.

Enable double buffering allows us to get bigger tiles and more reuse with less requirement on the context switch.

So directly enable it may not speedup things(because the old schedule is tuned to contain enough thread to hide the latency). We might need to enable it and also increase tile size to get a schedule with more reuse and also hide loading cost

@tqchen
Copy link
Member Author

tqchen commented Sep 1, 2017

@Huyuwei @wetliu @Laurawly @icemelon9

@tqchen tqchen merged commit a45d3b0 into apache:master Sep 1, 2017
@domin1985
Copy link
Contributor

domin1985 commented Dec 11, 2019

Hi @tqchen , May I ask a question?
Why should we erase for Variable op?

void Visit_(const Variable* op) final {
if (touched_.count(op)) {
touched_.erase(op);
}
}

@cee1
Copy link

cee1 commented Jan 21, 2022

Hi @tqchen , May I ask a question? Why should we erase for Variable op?

void Visit_(const Variable* op) final { if (touched_.count(op)) { touched_.erase(op); } }

We've experienced a problem due to this "touched_.erase(...)"

Background:Try to combine double buffer with cuda WMMA intrin, the TIR looks like

for (k.outer.outer.outer: int32, 0, 2) {
attr [im2col_reshape.shared] "double_buffer_write" = 1;
    for (...) {
        xxx_shared[...] = place_holder[...] // load next part
    }
}

...
for (...) {
    @tir.tvm_load_matrix_sync(..., @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8),  xxx_shared, ...), ...)
}

Here, the CallNode @tir.tvm_access_ptr references VarNode xxx_shared as its parameter.

Then, it will be removed from result of DoubleBufferDetector, aka touched_.erase(op)

@tqchen
Copy link
Member Author

tqchen commented Jan 21, 2022

cc @vinx13 to see if you have some comments

@vinx13
Copy link
Member

vinx13 commented Jan 22, 2022

@domin1985 @cee1 This pass can only work on regular buffer accesses (e.g. Load / Store). So in your case double buffering annotation will be ignored. It is possible to support these use cases, we need to detect these opaque usage, and do specific rewrite. Meanwhile I'm working on a software pipelining pass that is similar rewrite (https://github.com/vinx13/tvm/blob/feat/software_pipeline/src/tir/transforms/inject_software_pipeline.cc#L163). I'll send a PR upstream it next week

@cee1
Copy link

cee1 commented Mar 29, 2022

@domin1985 @cee1 This pass can only work on regular buffer accesses (e.g. Load / Store). So in your case double buffering annotation will be ignored. It is possible to support these use cases, we need to detect these opaque usage, and do specific rewrite. Meanwhile I'm working on a software pipelining pass that is similar rewrite (https://github.com/vinx13/tvm/blob/feat/software_pipeline/src/tir/transforms/inject_software_pipeline.cc#L163). I'll send a PR upstream it next week

Hi @vinx13, is there anyway to trigger this pass? (I haven't found any "te" of adding the annotation "software_pipeline_stage" or "software_pipeline_order" ...)

@vinx13
Copy link
Member

vinx13 commented Mar 31, 2022

@cee1 It is only supported in TIR schedule because block information is needed for analysis. In TIR, there is a schedule primitive sch.annotate that can be used to add such annotations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants