[Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping #10652

DzAvril · 2022-03-17T08:15:55Z

When employing double_buffer in tensor core conv2d template, such as:

# conv2d_implicit_gemm_tensorcore.py
def schedule_implicit_gemm_tensorcore(cfg, sch, Conv):
    ...
    # double buffer
    sch[AS].double_buffer()
    ...

It results in such assert:

Check failed: (it != buf_map_.end()) is false: Cannot find allocated buffer for buffer(im2col_reshape.shared, 0x5619189281f0)

Attached is my test script.
test_double_buffer.py.txt

As described above, buffer im2col_reshape.shared will be doubled. There are several passes for lowering double_buffer. One of them is storage_flatten. The lowered tir fed to storage_flatten pass has two types of attributes include buffer im2col_reshape.shared.

buffer(im2col_reshape.shared, 0x55e3524f0a40), op->attr_key: double_buffer_scope
[buffer(buffer, 0x55e3524f44d0), buffer(im2col_reshape.shared, 0x55e3524f0a40)], op->attr_key: buffer_bind_scope

Noticed that at the begin of this pass the pointers of buffer im2col_reshape.shared are same. There is a pass buffer_stride in pass storage_flatten which has such process for attribute statement.

// storage_flatten.cc:BufferStrideLegalize
Stmt VisitStmt_(const AttrStmtNode* op) final {
    if (op->attr_key == attr::buffer_dim_align) {
      auto buffer = Downcast<tir::Buffer>(op->node);
      const CallNode* tuple = op->value.as<CallNode>();
      ICHECK(tuple && tuple->op.same_as(builtin::tvm_tuple()));
      auto& vinfo = dim_align_[buffer];
      int dim = tuple->args[0].as<IntImmNode>()->value;
      if (static_cast<size_t>(dim) >= vinfo.size()) {
        vinfo.resize(dim + 1);
      }
      vinfo[dim].align_factor = tuple->args[1].as<IntImmNode>()->value;
      vinfo[dim].align_offset = tuple->args[2].as<IntImmNode>()->value;
      return this->VisitStmt(op->body);
    } else if (op->attr_key == attr::buffer_bind_scope) {
      Array<ObjectRef> arr = Downcast<Array<ObjectRef>>(op->node);
      ICHECK_EQ(arr.size(), 2U);
      Buffer source = Downcast<Buffer>(arr[0]);
      Buffer target_with_strides = WithStrides(Downcast<Buffer>(arr[1]));
      Buffer source_with_strides = WithStrides(source);

      {
        BufferEntry entry;
        entry.remap_to = source_with_strides;
        entry.in_scope = true;
        entry.is_external = false;
        buf_map_[source] = entry;
      }
      Stmt body = this->VisitStmt(op->body);
      return AttrStmt(Array<ObjectRef>{source_with_strides, target_with_strides}, op->attr_key,
                      op->value, body, op->span);
    } else {
      return StmtExprMutator::VisitStmt_(op);
    }
  }

In branch op->attr_key == attr::buffer_bind_scope, buffer im2col_reshape.shared is passed to WithStrides and modified in it. Then buffer im2col_reshape.shared changes to [buffer(buffer, 0x55e3524f44d0), buffer(im2col_reshape.shared, 0x55e352500860)], op->attr_key: buffer_bind_scope. Noticed again the pointer to im2col_reshape.shared is changed afterwards. And because there is none branch for processing attribute double_buffer_scope, this cause mismatch of pointer to buffer im2col_reshape.shared.
So add branch below will solve this issue.

    else if (op->attr_key == attr::double_buffer_scope) {
      auto buffer = Downcast<tir::Buffer>(op->node);
      Buffer buffer_with_strides = WithStrides(buffer);
      Stmt body = this->VisitStmt(op->body);
      return AttrStmt(buffer_with_strides, op->attr_key, op->value, body, op->span);
    }

DzAvril · 2022-03-17T08:58:56Z

cc @FrozenGene

DzAvril · 2022-03-20T03:46:38Z

Hi @FrozenGene, the pipeline of check tvm-ci/pr-merge reports "These pytest invocations failed, the results can be found in the Jenkins 'Tests' tab or by scrolling up through the raw logs here.". But in the 'Tests' tab it says "All tests are passing. Nice one! This run fixed 6 tests and now all 31661 tests for this pipeline are passing." I have no idea how to locate the failed test, can you give me a hint?

DzAvril · 2022-03-20T03:54:14Z

Hi @FrozenGene, the pipeline of check tvm-ci/pr-merge reports "These pytest invocations failed, the results can be found in the Jenkins 'Tests' tab or by scrolling up through the raw logs here.". But in the 'Tests' tab it says "All tests are passing. Nice one! This run fixed 6 tests and now all 31661 tests for this pipeline are passing." I have no idea how to locate the failed test, can you give me a hint?

I found this PR #10687 related to my CI check problem. I will try again after this PR merged.

…annot find allocated buffer for buffer'

DzAvril · 2022-03-21T09:22:07Z

@FrozenGene All tests are passing. Please review.

DzAvril · 2022-03-23T11:30:05Z

There is a bug that double_buffer doesn't work in tensor core conv2d template. The test code is the same as the attachment above. After lowered I found buffer AS and BS weren't doubled.
During lowering, first need to detect double buffer variables which has attr_key: double_buffer_scop and add the buffer to unordered_set touched_, in this case the name of double buffer variable is T_reshape.shared.

// inject_double_buffer.cc:DoubleBufferDetector
void VisitStmt_(const AttrStmtNode* op) final {
if (op->attr_key == attr::double_buffer_scope) {
    touched_.insert(op->node.as<VarNode>());
    StmtExprVisitor::VisitStmt_(op);
} else {
    StmtExprVisitor::VisitStmt_(op);
}
}

As tensor core conv2d template employs tensor intrin, this brings a call node tir.tvm_access_ptr and one of its parameters is T_reshape.shared. When a call node is visited by StmtExprVisitor, its parameters will be visited too. So comes to this function:

// inject_double_buffer.cc:DoubleBufferDetector
void VisitExpr_(const VarNode* op) final {
    if (touched_.count(op)) {
        touched_.erase(op);
    }
}

As the code shows, T_reshape.shared will be erased from touched_, so double_buffer doesn't work in the end.
Then why erase the double_buffer which is a parameter of a call node? I guess the author expects double buffer just in load node or store node, so double buffer in call node is not in his/her expectation.
The solution is simply and specify for tensor core conv2d template which employs tensor intrin. When visit a call node type is tvm_access_ptr, skip visit its parameters.

void VisitExpr_(const CallNode* op) final {
    // do not visit var in tvm_access_ptr
    if (op->op.same_as(builtin::tvm_access_ptr())) {
        return;
    }
    StmtExprVisitor::VisitExpr_(op);
}

Reference to origin PR: #405

DzAvril · 2022-03-23T11:38:58Z

After fixing the two bugs above, double buffer works in the final Cuda code, but it causes precision dropping.
Quote description in [PASS] InjectDoubleBuffer #405. Double buffer changes source code to target code shown below.
Source

for (i, 0, 100) {
  allocate B[float32 * 4]
  for (i, 0, 4) {
    B[i] = A[((i*4) + i)]
  }
  for (i, 0, 4) {
    A[i] = (B[i] + 1.000000f)
  }
}

Target

allocate B[float32 * 2 * 4]
for (i, 0, 4) {
  B[i] = A[i]
}
for (i, 0, 99) { 
  // prefetch next iteration
  for (i, 0, 4) {
    B[((((i + 1) % 2)*4) + i)] = A[(((i*4) + i) + 4)]
  }
  for (i, 0, 4) {
    A[i] = (B[(((i % 2)*4) + i)] + 1.000000f)
  }
}
for (i, 0, 4) {
  A[i] = (B[(i + 4)] + 1.000000f)
}

In the target code, the size of B is doubled. In the second for loop, first read data into the last half part of B and then process the first half part of B. So computation can hide the latency of reading global memory.
As described in the previous comment, double buffer in tensor core conv2d template brings a call node tir.tvm_access_ptr, this function reads data from doubled buffer im2col_reshape.shared and pass data to another function for processing. Part of lowered tir shown as below (PS. the code block below is not generated by the test script attached above, just for easier explanation):

for (k.outer.outer.outer: int32, 0, 2) {
    if ((k.outer.outer.outer + 1) < 3) {
    attr [im2col_reshape.shared] "double_buffer_write" = 1;
    for (ax0.ax1.outer.fused.outer.outer.outer_1: int32, 0, 4) {
        attr [IterVar(threadIdx.z, (nullptr), "ThreadIndex", "threadIdx.z")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y, (nullptr), "ThreadIndex", "threadIdx.y")] "thread_extent" = 2;
        attr [IterVar(threadIdx.x, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
        im2col_reshape.shared[(broadcast((floormod((k.outer.outer.outer + 1), 2)*2560), 8) + ramp(((((ax0.ax1.outer.fused.outer.outer.outer_1*640) + (threadIdx.y*320)) + (floordiv(threadIdx.x, 4)*40)) + (floormod(threadIdx.x, 4)*8)), 1, 8))] = (int8x8*)placeholder_7[ramp((((((((blockIdx.x*12288) + (ax0.outer.outer*6144)) + (ax0.ax1.outer.fused.outer.outer.outer_1*1536)) + (threadIdx.y*768)) + (floordiv(threadIdx.x, 4)*96)) + ((k.outer.outer.outer + 1)*32)) + (floormod(threadIdx.x, 4)*8)), 1, 8)]
    }
    }
    if ((k.outer.outer.outer + 1) < 3) {
    attr [placeholder_reshape.shared] "double_buffer_write" = 1;
    attr [IterVar(threadIdx.z, (nullptr), "ThreadIndex", "threadIdx.z")] "thread_extent" = 1;
    attr [IterVar(threadIdx.y, (nullptr), "ThreadIndex", "threadIdx.y")] "thread_extent" = 2;
    attr [IterVar(threadIdx.x, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
    if (((threadIdx.y*8) + floordiv(threadIdx.x, 4)) < 8) {
        if (((threadIdx.y*32) + threadIdx.x) < 32) {
        if (threadIdx.y < 1) {
            placeholder_reshape.shared[(broadcast((floormod((k.outer.outer.outer + 1), 2)*320), 8) + ramp((((threadIdx.y*320) + (floordiv(threadIdx.x, 4)*40)) + (floormod(threadIdx.x, 4)*8)), 1, 8))] = (int8x8*)placeholder_8[ramp((((((threadIdx.y*768) + (blockIdx.y*768)) + (floordiv(threadIdx.x, 4)*96)) + ((k.outer.outer.outer + 1)*32)) + (floormod(threadIdx.x, 4)*8)), 1, 8)]
        }
        }
    }
    }
    for (k.outer.inner: int32, 0, 2) {
    allocate(im2col_reshape.shared.wmma.matrix_a: Pointer(wmma.matrix_a int8), int8, [64, 16]), storage_scope = wmma.matrix_a {
        for (ax0.outer: int32, 0, 2) {
        @tir.tvm_load_matrix_sync(im2col_reshape.shared.wmma.matrix_a, 32, 8, 16, ax0.outer, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), im2col_reshape.shared, ((ax0.outer*1280) + (k.outer.inner*16)), 1280, 1, dtype=handle), 40, "row_major", dtype=handle)
        }
        allocate(placeholder_reshape.shared.wmma.matrix_b: Pointer(wmma.matrix_b int8), int8, [8, 16]), storage_scope = wmma.matrix_b {
        @tir.tvm_load_matrix_sync(placeholder_reshape.shared.wmma.matrix_b, 32, 8, 16, 0, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), placeholder_reshape.shared, (k.outer.inner*16), 320, 1, dtype=handle), 40, "col_major", dtype=handle)
        for (i.c.outer: int32, 0, 2) {
            @tir.tvm_mma_sync(implicit_gemm_conv.wmma.accumulator, i.c.outer, im2col_reshape.shared.wmma.matrix_a, i.c.outer, placeholder_reshape.shared.wmma.matrix_b, 0, implicit_gemm_conv.wmma.accumulator, i.c.outer, dtype=handle)
        }
        }
    }
    }
}

In the first iterate in the loop for (k.outer.outer.outer: int32, 0, 2) , we load data from global memory to the last half part of im2col_reshape.shared, and we process first half part data of im2col_reshape.shared. In the second iterate we load the first half part of im2col_reshape.shared, but we still process the first half part data of im2col_reshape.shared.

I guess the author expects double buffer just in load node or store node, so double buffer in call node is not in his/her expectation.

As I guessed in the previous comment, the author didn't expect double buffer as a parameter of a call node. So the solution is processing double buffer in call node.

// inject_double_buffer:DoubleBufferInjector
  PrimExpr VisitExpr_(const CallNode* op) final {
    if (op->op.same_as(builtin::tvm_access_ptr())) {
      const VarNode* buf = op->args[1].as<VarNode>();
      auto it = dbuffer_info_.find(buf);
      if (it != dbuffer_info_.end()) {
        const StorageEntry& e = it->second;
        ICHECK(e.stride.defined());
        ICHECK(e.switch_read_var.defined());
        Array<PrimExpr> args;
        // dtype
        args.push_back(op->args[0]);
        // data
        args.push_back(op->args[1]);
        // offset
        args.push_back(e.switch_read_var * e.stride + op->args[2]);
        // extent
        args.push_back(op->args[3]);
        // rw_mask
        args.push_back(op->args[4]);
        return Call(op->dtype, op->op, args);
      } else {
        return GetRef<PrimExpr>(op);
      }
    } else {
      return StmtExprMutator::VisitExpr_(op);
    }
  }

DzAvril · 2022-04-11T01:17:26Z

PR #10066 has a more elegant way to bring the double buffer into the final generated code. This PR will be closed in a few days if no one has interested in it.

XuZhi added 2 commits March 21, 2022 10:53

[Bug] Employ double_buffer in tensorcore conv2d results assert of ' C…

1a11d7e

…annot find allocated buffer for buffer'

skip stride buffer if op node was not buffer node

16ed497

DzAvril force-pushed the double_buffer branch from 3a76bcf to 16ed497 Compare March 21, 2022 02:59

fix bug using double buffer in tensor core template

f604e9b

DzAvril changed the title ~~[Bug] Employ double_buffer in tensorcore conv2d results assert error~~ [Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping Mar 23, 2022

DzAvril closed this Apr 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping #10652

[Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping #10652

DzAvril commented Mar 17, 2022 •

edited

Loading

DzAvril commented Mar 17, 2022

DzAvril commented Mar 20, 2022

DzAvril commented Mar 20, 2022

DzAvril commented Mar 21, 2022

DzAvril commented Mar 23, 2022

DzAvril commented Mar 23, 2022

DzAvril commented Apr 11, 2022

[Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping #10652

[Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping #10652

Conversation

DzAvril commented Mar 17, 2022 • edited Loading

DzAvril commented Mar 17, 2022

DzAvril commented Mar 20, 2022

DzAvril commented Mar 20, 2022

DzAvril commented Mar 21, 2022

DzAvril commented Mar 23, 2022

DzAvril commented Mar 23, 2022

DzAvril commented Apr 11, 2022

DzAvril commented Mar 17, 2022 •

edited

Loading