Fix EVT for S32 accum and BF16 C/output tensors #1826

alexsamardzic · 2024-09-18T15:17:56Z

To reproduce the problem:

First, apply the patch below to change 47_ampere_gemm_universal_streamk_broadcast example, so that S8/S8 GEMM is performed, producing S32 result, and then the accumulator is combined with some F16 values in the epilogue, to produce F16 result. After these changes, the exampe will build and run fine. However, if then cutlass::half_t replaced with cutlass::bfloat16_t in examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu, the example won't build. The reason for failure is missing specialization of DefaultIteratorsTensorOp, that this PR is adding.

The patch for 47_ampere_gemm_universal_streamk_broadcast example, to reproduce the problem

diff --git a/examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu b/examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu
index ed65e58c..e2125bdf 100644
--- a/examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu
+++ b/examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu
@@ -96,13 +96,13 @@
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 // A matrix configuration
-using         ElementA         = cutlass::half_t;                                  // Element type for A matrix operand
+using         ElementA         = int8_t;                                  // Element type for A matrix operand
 using         LayoutA          = cutlass::layout::RowMajor;                        // Layout type for A matrix operand
 constexpr int AlignmentA       = 128 / cutlass::sizeof_bits<ElementA>::value;      // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
 
 // B matrix configuration
-using         ElementB         = cutlass::half_t;                                  // Element type for B matrix operand
-using         LayoutB          = cutlass::layout::RowMajor;                        // Layout type for B matrix operand
+using         ElementB         = int8_t;                                  // Element type for B matrix operand
+using         LayoutB          = cutlass::layout::ColumnMajor;                        // Layout type for B matrix operand
 constexpr int AlignmentB       = 128 / cutlass::sizeof_bits<ElementB>::value;      // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
 
 // C1/C2/D matrix configuration
@@ -116,13 +116,13 @@ using         LayoutOutput     = cutlass::layout::RowMajor;
 // constexpr int AlignmentOutput  = 128 / cutlass::sizeof_bits<ElementOutput>::value; // Memory access granularity/alignment of output matrices in units of elements (up to 16 bytes)
 
 // Multiply-accumulate blocking/pipelining details
-using ElementAccumulator  = cutlass::half_t;                          // Element type for internal accumulation
-using ElementCompute      = cutlass::half_t;                          // Element type for compute
+using ElementAccumulator  = int32_t;                          // Element type for internal accumulation
+using ElementCompute      = float;                          // Element type for compute
 using ArchTag             = cutlass::arch::Sm80;                      // Tag indicating the minimum SM that supports the intended feature
 using OperatorClass       = cutlass::arch::OpClassTensorOp;           // Operator class tag
-using ThreadblockShape    = cutlass::gemm::GemmShape<128, 128, 32>;   // Threadblock-level tile size (concept: GemmShape)
-using WarpShape           = cutlass::gemm::GemmShape<64, 64, 32>;     // Warp-level tile size (concept: GemmShape)
-using InstructionShape    = cutlass::gemm::GemmShape<16, 8, 16>;      // Instruction-level tile size (concept: GemmShape)
+using ThreadblockShape    = cutlass::gemm::GemmShape<128, 128, 128>;   // Threadblock-level tile size (concept: GemmShape)
+using WarpShape           = cutlass::gemm::GemmShape<64, 64, 64>;     // Warp-level tile size (concept: GemmShape)
+using InstructionShape    = cutlass::gemm::GemmShape<16, 8, 32>;      // Instruction-level tile size (concept: GemmShape)
 constexpr int NumStages   = 4;                                        // Number of global->shared pipeline stages used in the GEMM mainloop
 constexpr int EVTEpilogueStages = 1;                                  // Number of epilogue stages in EVT
 
@@ -253,7 +253,7 @@ using EVTKernelStreamK =
     EVTD,
     cutlass::gemm::threadblock::ThreadblockSwizzleStreamK,
     NumStages,
-    cutlass::arch::OpMultiplyAdd,
+    cutlass::arch::OpMultiplyAddSaturate,
     EVTEpilogueStages
 >::GemmKernel;
 
@@ -707,32 +707,32 @@ int main(int argc, const char **argv)
   if (options.split_k_factor == 1)
   {
     // Compare basic data-parallel version versus StreamK version using default load-balancing heuristics
-    Result basic_dp         = run<DeviceGemmBasic>("Basic data-parallel GEMM", options);
+    // Result basic_dp         = run<DeviceGemmBasic>("Basic data-parallel GEMM", options);
     Result streamk_default  = run<DeviceGemmStreamK>("StreamK GEMM with default load-balancing", options);
 
-    printf("  Speedup vs Basic-DP: %.3f\n", (basic_dp.avg_runtime_ms / streamk_default.avg_runtime_ms));
+    // printf("  Speedup vs Basic-DP: %.3f\n", (basic_dp.avg_runtime_ms / streamk_default.avg_runtime_ms));
 
     // Show that StreamK can emulate basic data-parallel GEMM when we set the number of SMs to load-balance across = 1
     options.avail_sms       = 1;        // Set loadbalancing width to 1 SM (no load balancing)
     Result streamk_dp       = run<DeviceGemmStreamK>("StreamK emulating basic data-parallel GEMM", options);
     options.avail_sms       = -1;       // Reset loadbalancing width to unspecified SMs (i.e., the number of device SMs)
 
-    printf("  Speedup vs Basic-DP: %.3f\n", (basic_dp.avg_runtime_ms / streamk_dp.avg_runtime_ms));
+    // printf("  Speedup vs Basic-DP: %.3f\n", (basic_dp.avg_runtime_ms / streamk_dp.avg_runtime_ms));
 
     options.split_k_factor++;     // Increment splitting factor for next evaluation
 
   }
 
   // Show that StreamK can emulate "Split-K" with a tile-splitting factor
-  Result basic_splitk = run<DeviceGemmBasic>(
-    std::string("Basic split-K GEMM with tile-splitting factor ") + std::to_string(options.split_k_factor),
-    options);
+  // Result basic_splitk = run<DeviceGemmBasic>(
+  //   std::string("Basic split-K GEMM with tile-splitting factor ") + std::to_string(options.split_k_factor),
+  //   options);
 
   Result streamk_splitk = run<DeviceGemmStreamK>(
     std::string("StreamK emulating Split-K GEMM with tile-splitting factor ") + std::to_string(options.split_k_factor),
     options);
 
-  printf("  Speedup vs Basic-SplitK: %.3f\n", (basic_splitk.avg_runtime_ms / streamk_splitk.avg_runtime_ms));
+  // printf("  Speedup vs Basic-SplitK: %.3f\n", (basic_splitk.avg_runtime_ms / streamk_splitk.avg_runtime_ms));
 
   return 0;
 }

(Note: This PR is practically a completion of #812. BTW, the issue is found in the context of this work.)

…sorOp

alexsamardzic · 2024-10-14T18:39:03Z

~~@hwu36: Could someone please check, and eventually merge, this?~~

Apparently, this fix is included in 3.6.0. Closing the PR.

alexsamardzic mentioned this pull request Sep 18, 2024

W4A8 based on CUTLASS pytorch/ao#880

Merged

alexsamardzic force-pushed the fix-evt-int32-accum-bfloat16-c branch from 85aa744 to 8dbe183 Compare October 1, 2024 20:04

Add missing bfloat16_t/int32_t specialization for DefaultIteratorsTen…

fe42718

…sorOp

alexsamardzic force-pushed the fix-evt-int32-accum-bfloat16-c branch from 8dbe183 to fe42718 Compare October 14, 2024 18:38

alexsamardzic closed this Oct 14, 2024

alexsamardzic deleted the fix-evt-int32-accum-bfloat16-c branch October 14, 2024 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix EVT for S32 accum and BF16 C/output tensors #1826

Fix EVT for S32 accum and BF16 C/output tensors #1826

alexsamardzic commented Sep 18, 2024 •

edited

Loading

alexsamardzic commented Oct 14, 2024 •

edited

Loading

Fix EVT for S32 accum and BF16 C/output tensors #1826

Fix EVT for S32 accum and BF16 C/output tensors #1826

Conversation

alexsamardzic commented Sep 18, 2024 • edited Loading

alexsamardzic commented Oct 14, 2024 • edited Loading

alexsamardzic commented Sep 18, 2024 •

edited

Loading

alexsamardzic commented Oct 14, 2024 •

edited

Loading