Fp8 Fast Accumulation support for cublasLt #6599

wenscarl · 2023-10-27T20:01:37Z

FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue##6168

This PR is closely related to Flax PR-.

kaixih · 2023-10-27T21:37:55Z

xla/stream_executor/cuda/cuda_blas_lt.cc

+  // accumulation is enabled. When Precision is set to HIGHEST, indicative of
+  // scenarios in backward propagation, a higher precision accumulation method
+  // is utilized.
+  bool fast_accum = (xla::primitive_util::IsF8Type(lhs_layout.dtype) ||


Can we call it use_fast_accum or enable_fast_accum to imply it is a bool? Also, shouldn't it be is_fp8(lhs) && is_fp8(rhs) && cfg.compute_precision == 0?

Based on current rewrite rule, Fp8 matmuls has to take both inputs as Fp8 types. So any one of them being Fp8 type should be good to indicate the Fp8 matmul.

kaixih · 2023-10-27T21:48:23Z

xla/stream_executor/cuda/cuda_blas_lt.cc

+  // encountered during forward propagation with E4M3 operands, fast
+  // accumulation is enabled. When Precision is set to HIGHEST, indicative of
+  // scenarios in backward propagation, a higher precision accumulation method
+  // is utilized.


There's no need to specify the particular cases for using different precisions. Let's simply state that for fp8 matmul, there are two options available: fast accumulation (PrecisionConfig.Precision.DEFAULT) and higher precision accumulation (PrecisionConfig.Precision.HIGHEST).

kaixih · 2023-10-27T21:51:30Z

xla/stream_executor/cuda/cuda_blas_lt.cc

@@ -210,6 +210,8 @@ cudaDataType_t BlasLt::MatrixLayout::type() const {
                             AsCublasOperation(trans_b)));
  TF_ASSIGN_OR_RETURN(cublasLtEpilogue_t epi, AsCublasLtEpilogue(epilogue));
  TF_RETURN_IF_ERROR(SetAttr(cu_desc, CUBLASLT_MATMUL_DESC_EPILOGUE, epi));
+  TF_RETURN_IF_ERROR(
+      SetAttr(cu_desc, CUBLASLT_MATMUL_DESC_FAST_ACCUM, int8_t(fast_accum)));


Can we use static_cast<int8_t>?

kaixih · 2023-10-30T20:01:44Z

@reedwm We conducted a brief performance test on this. Implementing this change alone could result in approximately a 4% speedup. Could you please review the PR, considering both this update and the flax one?"

reedwm

Is it possible to add a test by checking an FP8 matmul output is accurate enough if the PrecisionConfig is HIGHEST? I'm OK having no test if it's not easy to test this.

reedwm · 2023-10-31T03:49:58Z

xla/stream_executor/cuda/cuda_blas_lt.cc

+  // For FP8 matmuls, there are two options available: fast
+  // accumulation(PrecisionConfig.Precision.DEFAULT) and
+  //  higher precision accumulation (PrecisionConfig.Precision.HIGHEST).


You don't mention the HIGHER case. I would phrase this as:

FP8 matmuls have a fast accumulation mode that is less precise than the default accumulation mode. Use the fast accumulation mode if the compute precision is DEFAULT.

Added a test in gemm_rewrite_test.cc

akuegel · 2023-11-02T10:59:09Z

xla/service/gpu/tests/gemm_rewrite_test.cc

+  replacements["<<precision>>"] = "default";
+  const auto hlo_text_default = absl::StrReplaceAll(hlo_template, replacements);
+  EXPECT_TRUE(RunAndCompare(hlo_text_default, ErrorSpec{1e-3, 1e-3}));
+  EXPECT_FALSE(RunAndCompare(hlo_text_default, ErrorSpec{1e-4, 1e-4}));


This expectation seems to fail, it seems if we are lucky, it already has enough precision to pass with a tolerance of 1e-4.

Right. Do you suggest to remove this test or replace by some file check?

This should be guarded by a check for Ada/Hopper, because it only affects those GPUs and our tests are running on pre-Ada GPUs. But it's possible new GPUs will treat the fast-accumulation flag differently, so we should not do this check anyway.

I'll remove this line when merging.

Imported from GitHub PR #6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue##6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da by shuw <shuw@nvidia.com>: Add FP8 fast accumulation support for cublasLt. -- 9684568 by shuw <shuw@nvidia.com>: Improve based on review #1 -- e906d76 by shuw <shuw@nvidia.com>: Improve based on review #2 Merging this change closes #6599 FUTURE_COPYBARA_INTEGRATE_REVIEW=#6599 from wenscarl:fp8_fast_accumulation e906d76 PiperOrigin-RevId: 578904075

Imported from GitHub PR openxla/xla#6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue#openxla/xla#6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da8ca08cd2d4796a7b8f032827867a361bc by shuw <shuw@nvidia.com>: Add FP8 fast accumulation support for cublasLt. -- 96845683cc4b1e7b947bc919fbf97d8865abeac9 by shuw <shuw@nvidia.com>: Improve based on review #1 -- e906d7620780d2cf1fe8433c933648dcb98dc61d by shuw <shuw@nvidia.com>: Improve based on review #2 Merging this change closes #6599 PiperOrigin-RevId: 578948593

cuBLAS LT has a flag, CUBLASLT_MATMUL_DESC_FAST_ACCUM, that can be set for FP8 gemms. This flag causes the matmul to run faster but with lower accumulation precision. NVIDIA recommends using this flag on the forward pass on FP8 models but not the backwards pass, since the backwards pass needs more accumulation precision. The PR openxla/xla#6599 enabled fast accumulation on FP8 dots whose PrecisionConfig is DEFAULT (but not HIGH or HIGHEST). This allows layers in frameworks to use fast accumulation on the forward pass but not the backwards pass by setting the PrecisionConfig on the backwards pass to be HIGH or HIGHEST. The issue is, Flax and Praxis do not yet set the PrecisionConfig to HIGH or HIGHEST on the backwards pass, so the PR will cause poor FP8 training quality. The PR should not have been merged until Flax and Praxis set the PrecisionConfig, but I didn't realize this and merged it anyway. Reverting the PR is a pain, so instead this CL just removes the line that sets CUBLASLT_MATMUL_DESC_FAST_ACCUM, while keeping most of the plumbing around it. This CL will be rolled back once Flax and Praxis set the PrecisionConfig. PiperOrigin-RevId: 579018421

cuBLAS LT has a flag, CUBLASLT_MATMUL_DESC_FAST_ACCUM, that can be set for FP8 gemms. This flag causes the matmul to run faster but with lower accumulation precision. NVIDIA recommends using this flag on the forward pass on FP8 models but not the backwards pass, since the backwards pass needs more accumulation precision. The PR #6599 enabled fast accumulation on FP8 dots whose PrecisionConfig is DEFAULT (but not HIGH or HIGHEST). This allows layers in frameworks to use fast accumulation on the forward pass but not the backwards pass by setting the PrecisionConfig on the backwards pass to be HIGH or HIGHEST. The issue is, Flax and Praxis do not yet set the PrecisionConfig to HIGH or HIGHEST on the backwards pass, so the PR will cause poor FP8 training quality. The PR should not have been merged until Flax and Praxis set the PrecisionConfig, but I didn't realize this and merged it anyway. Reverting the PR is a pain, so instead this CL just removes the line that sets CUBLASLT_MATMUL_DESC_FAST_ACCUM, while keeping most of the plumbing around it. This CL will be rolled back once Flax and Praxis set the PrecisionConfig. PiperOrigin-RevId: 579018421

github-actions bot added the kokoro:force-run Forces CI to rerun label Oct 27, 2023

github-actions bot assigned radhakrishnaba and xla-rotation Oct 27, 2023

kokoro-team removed the kokoro:force-run Forces CI to rerun label Oct 27, 2023

wenscarl mentioned this pull request Oct 27, 2023

FP8 GEMM for the fprop should use fast accumulation #6168

Closed

wenscarl requested a review from reedwm October 27, 2023 20:03

Add FP8 fast accumulation support for cublasLt.

a4140da

wenscarl force-pushed the fp8_fast_accumulation branch from 5eacd0f to a4140da Compare October 27, 2023 20:50

github-actions bot added the kokoro:force-run Forces CI to rerun label Oct 27, 2023

kokoro-team removed the kokoro:force-run Forces CI to rerun label Oct 27, 2023

wenscarl mentioned this pull request Oct 27, 2023

Allow for fast accumulation selection for FP8 GEMM google/flax#3416

Merged

kaixih reviewed Oct 27, 2023

View reviewed changes

Improve based on review openxla#1

9684568

github-actions bot added the kokoro:force-run Forces CI to rerun label Oct 30, 2023

kokoro-team removed the kokoro:force-run Forces CI to rerun label Oct 30, 2023

reedwm requested changes Oct 31, 2023

View reviewed changes

Improve based on review openxla#2

e906d76

github-actions bot added the kokoro:force-run Forces CI to rerun label Nov 1, 2023

kokoro-team removed the kokoro:force-run Forces CI to rerun label Nov 1, 2023

reedwm approved these changes Nov 1, 2023

View reviewed changes

akuegel reviewed Nov 2, 2023

View reviewed changes

copybara-service bot mentioned this pull request Nov 2, 2023

PR #6599: Fp8 Fast Accumulation support for cublasLt #6730

Closed

copybara-service bot closed this in b716639 Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp8 Fast Accumulation support for cublasLt #6599

Fp8 Fast Accumulation support for cublasLt #6599

wenscarl commented Oct 27, 2023 •

edited

Loading

kaixih Oct 27, 2023

wenscarl Oct 30, 2023

kaixih Oct 27, 2023

kaixih Oct 27, 2023

kaixih commented Oct 30, 2023

reedwm left a comment

reedwm Oct 31, 2023

wenscarl Nov 1, 2023

akuegel Nov 2, 2023

wenscarl Nov 2, 2023 •

edited

Loading

reedwm Nov 2, 2023

Fp8 Fast Accumulation support for cublasLt #6599

Fp8 Fast Accumulation support for cublasLt #6599

Conversation

wenscarl commented Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaixih commented Oct 30, 2023

reedwm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenscarl Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenscarl commented Oct 27, 2023 •

edited

Loading

wenscarl Nov 2, 2023 •

edited

Loading