[PASS] InjectDoubleBuffer #405

tqchen · 2017-09-01T00:57:59Z

This enables double buffering pre-fetching. Could be useful shared memory pre-fetching. One advantage of double buffering is that the logic explicit prefetchs next stage's input to the shared memory buffer.

Source

for (i, 0, 100) {
  allocate B[float32 * 4]
  for (i, 0, 4) {
    B[i] = A[((i*4) + i)]
  }
  for (i, 0, 4) {
    A[i] = (B[i] + 1.000000f)
  }
}

Target

allocate B[float32 * 2 * 4]
for (i, 0, 4) {
  B[i] = A[i]
}
for (i, 0, 99) { 
  // prefetch next iteration
  for (i, 0, 4) {
    B[((((i + 1) % 2)*4) + i)] = A[(((i*4) + i) + 4)]
  }
  for (i, 0, 4) {
    A[i] = (B[(((i % 2)*4) + i)] + 1.000000f)
  }
}
for (i, 0, 4) {
  A[i] = (B[(i + 4)] + 1.000000f)
}

Note

Usually when GPU fetches memory, there is a big latency before the data arrives. There are two ways to hide this cost:

Context switch to another GPU thread on the same block, this requires us to launch many GPU threads, limiting the resources(registers) used on each block
Do double buffering, to prefetch the data needed in next iteration.

There is a tradeoff here. Bigger tiles means more resources(registers) and more reuse, but harder to hide loading cost (because we launch less threads). Smaller tiles means more threads and easier to hide loading cost but less reuse.

Enable double buffering allows us to get bigger tiles and more reuse with less requirement on the context switch.

So directly enable it may not speedup things(because the old schedule is tuned to contain enough thread to hide the latency). We might need to enable it and also increase tile size to get a schedule with more reuse and also hide loading cost

tqchen · 2017-09-01T01:04:26Z

@Huyuwei @wetliu @Laurawly @icemelon9

domin1985 · 2019-12-11T03:05:16Z

Hi @tqchen , May I ask a question?
Why should we erase for Variable op?

void Visit_(const Variable* op) final {
if (touched_.count(op)) {
touched_.erase(op);
}
}

cee1 · 2022-01-21T07:07:18Z

Hi @tqchen , May I ask a question? Why should we erase for Variable op?

void Visit_(const Variable* op) final { if (touched_.count(op)) { touched_.erase(op); } }

We've experienced a problem due to this "touched_.erase(...)"

Background：Try to combine double buffer with cuda WMMA intrin, the TIR looks like

for (k.outer.outer.outer: int32, 0, 2) {
attr [im2col_reshape.shared] "double_buffer_write" = 1;
    for (...) {
        xxx_shared[...] = place_holder[...] // load next part
    }
}

...
for (...) {
    @tir.tvm_load_matrix_sync(..., @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8),  xxx_shared, ...), ...)
}

Here, the CallNode @tir.tvm_access_ptr references VarNode xxx_shared as its parameter.

Then, it will be removed from result of DoubleBufferDetector, aka touched_.erase(op)

tqchen · 2022-01-21T14:33:29Z

cc @vinx13 to see if you have some comments

vinx13 · 2022-01-22T02:02:05Z

@domin1985 @cee1 This pass can only work on regular buffer accesses (e.g. Load / Store). So in your case double buffering annotation will be ignored. It is possible to support these use cases, we need to detect these opaque usage, and do specific rewrite. Meanwhile I'm working on a software pipelining pass that is similar rewrite (https://github.com/vinx13/tvm/blob/feat/software_pipeline/src/tir/transforms/inject_software_pipeline.cc#L163). I'll send a PR upstream it next week

…ache#405)

cee1 · 2022-03-29T08:35:01Z

@domin1985 @cee1 This pass can only work on regular buffer accesses (e.g. Load / Store). So in your case double buffering annotation will be ignored. It is possible to support these use cases, we need to detect these opaque usage, and do specific rewrite. Meanwhile I'm working on a software pipelining pass that is similar rewrite (https://github.com/vinx13/tvm/blob/feat/software_pipeline/src/tir/transforms/inject_software_pipeline.cc#L163). I'll send a PR upstream it next week

Hi @vinx13, is there anyway to trigger this pass? (I haven't found any "te" of adding the annotation "software_pipeline_stage" or "software_pipeline_order" ...)

vinx13 · 2022-03-31T22:28:31Z

@cee1 It is only supported in TIR schedule because block information is needed for analysis. In TIR, there is a schedule primitive sch.annotate that can be used to add such annotations

[PASS] InjectDoubleBuffer

91d382c

tqchen merged commit a45d3b0 into apache:master Sep 1, 2017

junrushao added a commit to junrushao/tvm that referenced this pull request Jan 27, 2022

Migrate the meta_schedule::Schedule in to tir::TracedSchedule (ap…

e7421a1

…ache#405)

vinx13 pushed a commit to vinx13/tvm that referenced this pull request Mar 9, 2022

Migrate the meta_schedule::Schedule in to tir::TracedSchedule (ap…

d52d50d

…ache#405)

DzAvril mentioned this pull request Mar 23, 2022

[Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping #10652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PASS] InjectDoubleBuffer #405

[PASS] InjectDoubleBuffer #405

tqchen commented Sep 1, 2017 •

edited

Loading

tqchen commented Sep 1, 2017 •

edited

Loading

domin1985 commented Dec 11, 2019 •

edited

Loading

cee1 commented Jan 21, 2022 •

edited

Loading

tqchen commented Jan 21, 2022

vinx13 commented Jan 22, 2022

cee1 commented Mar 29, 2022

vinx13 commented Mar 31, 2022

[PASS] InjectDoubleBuffer #405

[PASS] InjectDoubleBuffer #405

Conversation

tqchen commented Sep 1, 2017 • edited Loading

Note

tqchen commented Sep 1, 2017 • edited Loading

domin1985 commented Dec 11, 2019 • edited Loading

cee1 commented Jan 21, 2022 • edited Loading

tqchen commented Jan 21, 2022

vinx13 commented Jan 22, 2022

cee1 commented Mar 29, 2022

vinx13 commented Mar 31, 2022

tqchen commented Sep 1, 2017 •

edited

Loading

tqchen commented Sep 1, 2017 •

edited

Loading

domin1985 commented Dec 11, 2019 •

edited

Loading

cee1 commented Jan 21, 2022 •

edited

Loading