Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping #10652

Closed
wants to merge 3 commits into from

Conversation

DzAvril
Copy link
Contributor

@DzAvril DzAvril commented Mar 17, 2022

When employing double_buffer in tensor core conv2d template, such as:

# conv2d_implicit_gemm_tensorcore.py
def schedule_implicit_gemm_tensorcore(cfg, sch, Conv):
    ...
    # double buffer
    sch[AS].double_buffer()
    ...

It results in such assert:

Check failed: (it != buf_map_.end()) is false: Cannot find allocated buffer for buffer(im2col_reshape.shared, 0x5619189281f0)

Attached is my test script.
test_double_buffer.py.txt

As described above, buffer im2col_reshape.shared will be doubled. There are several passes for lowering double_buffer. One of them is storage_flatten. The lowered tir fed to storage_flatten pass has two types of attributes include buffer im2col_reshape.shared.

  • buffer(im2col_reshape.shared, 0x55e3524f0a40), op->attr_key: double_buffer_scope
  • [buffer(buffer, 0x55e3524f44d0), buffer(im2col_reshape.shared, 0x55e3524f0a40)], op->attr_key: buffer_bind_scope

Noticed that at the begin of this pass the pointers of buffer im2col_reshape.shared are same. There is a pass buffer_stride in pass storage_flatten which has such process for attribute statement.

// storage_flatten.cc:BufferStrideLegalize
Stmt VisitStmt_(const AttrStmtNode* op) final {
    if (op->attr_key == attr::buffer_dim_align) {
      auto buffer = Downcast<tir::Buffer>(op->node);
      const CallNode* tuple = op->value.as<CallNode>();
      ICHECK(tuple && tuple->op.same_as(builtin::tvm_tuple()));
      auto& vinfo = dim_align_[buffer];
      int dim = tuple->args[0].as<IntImmNode>()->value;
      if (static_cast<size_t>(dim) >= vinfo.size()) {
        vinfo.resize(dim + 1);
      }
      vinfo[dim].align_factor = tuple->args[1].as<IntImmNode>()->value;
      vinfo[dim].align_offset = tuple->args[2].as<IntImmNode>()->value;
      return this->VisitStmt(op->body);
    } else if (op->attr_key == attr::buffer_bind_scope) {
      Array<ObjectRef> arr = Downcast<Array<ObjectRef>>(op->node);
      ICHECK_EQ(arr.size(), 2U);
      Buffer source = Downcast<Buffer>(arr[0]);
      Buffer target_with_strides = WithStrides(Downcast<Buffer>(arr[1]));
      Buffer source_with_strides = WithStrides(source);

      {
        BufferEntry entry;
        entry.remap_to = source_with_strides;
        entry.in_scope = true;
        entry.is_external = false;
        buf_map_[source] = entry;
      }
      Stmt body = this->VisitStmt(op->body);
      return AttrStmt(Array<ObjectRef>{source_with_strides, target_with_strides}, op->attr_key,
                      op->value, body, op->span);
    } else {
      return StmtExprMutator::VisitStmt_(op);
    }
  }

In branch op->attr_key == attr::buffer_bind_scope, buffer im2col_reshape.shared is passed to WithStrides and modified in it. Then buffer im2col_reshape.shared changes to [buffer(buffer, 0x55e3524f44d0), buffer(im2col_reshape.shared, 0x55e352500860)], op->attr_key: buffer_bind_scope. Noticed again the pointer to im2col_reshape.shared is changed afterwards. And because there is none branch for processing attribute double_buffer_scope, this cause mismatch of pointer to buffer im2col_reshape.shared.
So add branch below will solve this issue.

    else if (op->attr_key == attr::double_buffer_scope) {
      auto buffer = Downcast<tir::Buffer>(op->node);
      Buffer buffer_with_strides = WithStrides(buffer);
      Stmt body = this->VisitStmt(op->body);
      return AttrStmt(buffer_with_strides, op->attr_key, op->value, body, op->span);
    } 

@DzAvril
Copy link
Contributor Author

DzAvril commented Mar 17, 2022

cc @FrozenGene

@DzAvril
Copy link
Contributor Author

DzAvril commented Mar 20, 2022

Hi @FrozenGene, the pipeline of check tvm-ci/pr-merge reports "These pytest invocations failed, the results can be found in the Jenkins 'Tests' tab or by scrolling up through the raw logs here.". But in the 'Tests' tab it says "All tests are passing. Nice one! This run fixed 6 tests and now all 31661 tests for this pipeline are passing." I have no idea how to locate the failed test, can you give me a hint?

@DzAvril
Copy link
Contributor Author

DzAvril commented Mar 20, 2022

Hi @FrozenGene, the pipeline of check tvm-ci/pr-merge reports "These pytest invocations failed, the results can be found in the Jenkins 'Tests' tab or by scrolling up through the raw logs here.". But in the 'Tests' tab it says "All tests are passing. Nice one! This run fixed 6 tests and now all 31661 tests for this pipeline are passing." I have no idea how to locate the failed test, can you give me a hint?

I found this PR #10687 related to my CI check problem. I will try again after this PR merged.

@DzAvril
Copy link
Contributor Author

DzAvril commented Mar 21, 2022

@FrozenGene All tests are passing. Please review.

@DzAvril DzAvril changed the title [Bug] Employ double_buffer in tensorcore conv2d results assert error [Bug] Employing double_buffer in tensor core conv2d results in error and precision dropping Mar 23, 2022
@DzAvril
Copy link
Contributor Author

DzAvril commented Mar 23, 2022

There is a bug that double_buffer doesn't work in tensor core conv2d template. The test code is the same as the attachment above. After lowered I found buffer AS and BS weren't doubled.
During lowering, first need to detect double buffer variables which has attr_key: double_buffer_scop and add the buffer to unordered_set touched_, in this case the name of double buffer variable is T_reshape.shared.

// inject_double_buffer.cc:DoubleBufferDetector
void VisitStmt_(const AttrStmtNode* op) final {
if (op->attr_key == attr::double_buffer_scope) {
    touched_.insert(op->node.as<VarNode>());
    StmtExprVisitor::VisitStmt_(op);
} else {
    StmtExprVisitor::VisitStmt_(op);
}
}

As tensor core conv2d template employs tensor intrin, this brings a call node tir.tvm_access_ptr and one of its parameters is T_reshape.shared. When a call node is visited by StmtExprVisitor, its parameters will be visited too. So comes to this function:

// inject_double_buffer.cc:DoubleBufferDetector
void VisitExpr_(const VarNode* op) final {
    if (touched_.count(op)) {
        touched_.erase(op);
    }
}

As the code shows, T_reshape.shared will be erased from touched_, so double_buffer doesn't work in the end.
Then why erase the double_buffer which is a parameter of a call node? I guess the author expects double buffer just in load node or store node, so double buffer in call node is not in his/her expectation.
The solution is simply and specify for tensor core conv2d template which employs tensor intrin. When visit a call node type is tvm_access_ptr, skip visit its parameters.

void VisitExpr_(const CallNode* op) final {
    // do not visit var in tvm_access_ptr
    if (op->op.same_as(builtin::tvm_access_ptr())) {
        return;
    }
    StmtExprVisitor::VisitExpr_(op);
}

Reference to origin PR: #405

@DzAvril
Copy link
Contributor Author

DzAvril commented Mar 23, 2022

After fixing the two bugs above, double buffer works in the final Cuda code, but it causes precision dropping.
Quote description in [PASS] InjectDoubleBuffer #405. Double buffer changes source code to target code shown below.
Source

for (i, 0, 100) {
  allocate B[float32 * 4]
  for (i, 0, 4) {
    B[i] = A[((i*4) + i)]
  }
  for (i, 0, 4) {
    A[i] = (B[i] + 1.000000f)
  }
}

Target

allocate B[float32 * 2 * 4]
for (i, 0, 4) {
  B[i] = A[i]
}
for (i, 0, 99) { 
  // prefetch next iteration
  for (i, 0, 4) {
    B[((((i + 1) % 2)*4) + i)] = A[(((i*4) + i) + 4)]
  }
  for (i, 0, 4) {
    A[i] = (B[(((i % 2)*4) + i)] + 1.000000f)
  }
}
for (i, 0, 4) {
  A[i] = (B[(i + 4)] + 1.000000f)
}

In the target code, the size of B is doubled. In the second for loop, first read data into the last half part of B and then process the first half part of B. So computation can hide the latency of reading global memory.
As described in the previous comment, double buffer in tensor core conv2d template brings a call node tir.tvm_access_ptr, this function reads data from doubled buffer im2col_reshape.shared and pass data to another function for processing. Part of lowered tir shown as below (PS. the code block below is not generated by the test script attached above, just for easier explanation):

for (k.outer.outer.outer: int32, 0, 2) {
    if ((k.outer.outer.outer + 1) < 3) {
    attr [im2col_reshape.shared] "double_buffer_write" = 1;
    for (ax0.ax1.outer.fused.outer.outer.outer_1: int32, 0, 4) {
        attr [IterVar(threadIdx.z, (nullptr), "ThreadIndex", "threadIdx.z")] "thread_extent" = 1;
        attr [IterVar(threadIdx.y, (nullptr), "ThreadIndex", "threadIdx.y")] "thread_extent" = 2;
        attr [IterVar(threadIdx.x, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
        im2col_reshape.shared[(broadcast((floormod((k.outer.outer.outer + 1), 2)*2560), 8) + ramp(((((ax0.ax1.outer.fused.outer.outer.outer_1*640) + (threadIdx.y*320)) + (floordiv(threadIdx.x, 4)*40)) + (floormod(threadIdx.x, 4)*8)), 1, 8))] = (int8x8*)placeholder_7[ramp((((((((blockIdx.x*12288) + (ax0.outer.outer*6144)) + (ax0.ax1.outer.fused.outer.outer.outer_1*1536)) + (threadIdx.y*768)) + (floordiv(threadIdx.x, 4)*96)) + ((k.outer.outer.outer + 1)*32)) + (floormod(threadIdx.x, 4)*8)), 1, 8)]
    }
    }
    if ((k.outer.outer.outer + 1) < 3) {
    attr [placeholder_reshape.shared] "double_buffer_write" = 1;
    attr [IterVar(threadIdx.z, (nullptr), "ThreadIndex", "threadIdx.z")] "thread_extent" = 1;
    attr [IterVar(threadIdx.y, (nullptr), "ThreadIndex", "threadIdx.y")] "thread_extent" = 2;
    attr [IterVar(threadIdx.x, (nullptr), "ThreadIndex", "threadIdx.x")] "thread_extent" = 32;
    if (((threadIdx.y*8) + floordiv(threadIdx.x, 4)) < 8) {
        if (((threadIdx.y*32) + threadIdx.x) < 32) {
        if (threadIdx.y < 1) {
            placeholder_reshape.shared[(broadcast((floormod((k.outer.outer.outer + 1), 2)*320), 8) + ramp((((threadIdx.y*320) + (floordiv(threadIdx.x, 4)*40)) + (floormod(threadIdx.x, 4)*8)), 1, 8))] = (int8x8*)placeholder_8[ramp((((((threadIdx.y*768) + (blockIdx.y*768)) + (floordiv(threadIdx.x, 4)*96)) + ((k.outer.outer.outer + 1)*32)) + (floormod(threadIdx.x, 4)*8)), 1, 8)]
        }
        }
    }
    }
    for (k.outer.inner: int32, 0, 2) {
    allocate(im2col_reshape.shared.wmma.matrix_a: Pointer(wmma.matrix_a int8), int8, [64, 16]), storage_scope = wmma.matrix_a {
        for (ax0.outer: int32, 0, 2) {
        @tir.tvm_load_matrix_sync(im2col_reshape.shared.wmma.matrix_a, 32, 8, 16, ax0.outer, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), im2col_reshape.shared, ((ax0.outer*1280) + (k.outer.inner*16)), 1280, 1, dtype=handle), 40, "row_major", dtype=handle)
        }
        allocate(placeholder_reshape.shared.wmma.matrix_b: Pointer(wmma.matrix_b int8), int8, [8, 16]), storage_scope = wmma.matrix_b {
        @tir.tvm_load_matrix_sync(placeholder_reshape.shared.wmma.matrix_b, 32, 8, 16, 0, @tir.tvm_access_ptr(@tir.type_annotation(, dtype=int8), placeholder_reshape.shared, (k.outer.inner*16), 320, 1, dtype=handle), 40, "col_major", dtype=handle)
        for (i.c.outer: int32, 0, 2) {
            @tir.tvm_mma_sync(implicit_gemm_conv.wmma.accumulator, i.c.outer, im2col_reshape.shared.wmma.matrix_a, i.c.outer, placeholder_reshape.shared.wmma.matrix_b, 0, implicit_gemm_conv.wmma.accumulator, i.c.outer, dtype=handle)
        }
        }
    }
    }
}

In the first iterate in the loop for (k.outer.outer.outer: int32, 0, 2) , we load data from global memory to the last half part of im2col_reshape.shared, and we process first half part data of im2col_reshape.shared. In the second iterate we load the first half part of im2col_reshape.shared, but we still process the first half part data of im2col_reshape.shared.

I guess the author expects double buffer just in load node or store node, so double buffer in call node is not in his/her expectation.

As I guessed in the previous comment, the author didn't expect double buffer as a parameter of a call node. So the solution is processing double buffer in call node.

// inject_double_buffer:DoubleBufferInjector
  PrimExpr VisitExpr_(const CallNode* op) final {
    if (op->op.same_as(builtin::tvm_access_ptr())) {
      const VarNode* buf = op->args[1].as<VarNode>();
      auto it = dbuffer_info_.find(buf);
      if (it != dbuffer_info_.end()) {
        const StorageEntry& e = it->second;
        ICHECK(e.stride.defined());
        ICHECK(e.switch_read_var.defined());
        Array<PrimExpr> args;
        // dtype
        args.push_back(op->args[0]);
        // data
        args.push_back(op->args[1]);
        // offset
        args.push_back(e.switch_read_var * e.stride + op->args[2]);
        // extent
        args.push_back(op->args[3]);
        // rw_mask
        args.push_back(op->args[4]);
        return Call(op->dtype, op->op, args);
      } else {
        return GetRef<PrimExpr>(op);
      }
    } else {
      return StmtExprMutator::VisitExpr_(op);
    }
  }

@DzAvril
Copy link
Contributor Author

DzAvril commented Apr 11, 2022

PR #10066 has a more elegant way to bring the double buffer into the final generated code. This PR will be closed in a few days if no one has interested in it.

@DzAvril DzAvril closed this Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant