Skip to content

Conversation

@xwhzz
Copy link
Contributor

@xwhzz xwhzz commented Jul 21, 2025

This PR is intended to fix #506 ci bugs in hopper, including the following issues:

  • TVM Error InternalError
  • TMA (Tensor Memory Accelerator) Related Failures
  • Numerical Mismatch / Assertion Error
  • Stuck / Hung Tests

Also fix #479

  • CUDA Runtime Error: Misaligned Address
    • test_tilelang_issue_101.py
    • test_tilelang_kernel_deepseek_nsa.py
    • test_tilelang_kernel_dequantize_gemm.py
  • NVCC Compilation Failures
  • TMA (Tensor Memory Accelerator) Related Failures
  • Numerical Mismatch / Assertion Error
  • Fatal Python Error: Floating-Point Exception
  • Stuck / Hung Tests

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @xwhzz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several CI stability issues specific to the Hopper architecture, primarily focusing on improving the robustness of Tensor Memory Accelerator (TMA) operations and refining the conditions under which Warp Group Matrix Multiply Accumulate (WGMMA) is applied. The changes aim to prevent internal errors, resolve numerical mismatches, and ensure correct code generation for Hopper-specific optimizations, contributing to a more stable CI pipeline.

Highlights

  • TMA Robustness: Implemented stricter checks and graceful fallbacks for Tensor Memory Accelerator (TMA) bulk copy operations in src/op/bulk_copy.cc to prevent errors related to unsupported strides and layouts. This includes checks for global stride alignment and size, and replacing hard ICHECK(0) failures with return Stmt() for better error handling.
  • WGMMA Precision Control: Introduced a CheckWGMMA function in src/op/gemm.cc and src/op/gemm.h to precisely determine when Warp Group Matrix Multiply Accumulate (WGMMA) can be utilized. This function evaluates input/output data types (e.g., FP16, BF16, FP32, NVFloat8, Int8) and K-dimension alignment, ensuring numerical correctness and proper application of Hopper-specific optimizations. The Gemm::Lower and Gemm::InferLayout methods now leverage this check.
  • Dynamic TMA Lowering: Added logic in src/transform/lower_tile_op.cc to dynamically set the kDisableTMALower context configuration based on whether TMA operations (tma_load, tma_store, etc.) are actually present in the code. This allows for more optimized compilation by only running TMA-specific lowering passes when necessary.
  • Buffer Access Accuracy: Enhanced buffer access region analysis in src/transform/warp_specialized_rewriter.cc to correctly handle remapped buffers. The WSCodeEmitter now checks the buffer_data_to_buffer_ map when populating read and write sets, improving the accuracy of buffer tracking during warp specialization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces bug fixes for Hopper architecture support, focusing on TMA and WGMMA operations. The changes add checks to prevent crashes and fallback mechanisms. The review focuses on correctness, maintainability, and C++ best practices, addressing potential null pointer dereferences, code complexity, and duplication.

Comment on lines +222 to +270
bool Gemm::CheckWGMMA() const {
if (C->dtype == DataType::Float(16)) {
if (A->dtype == DataType::Float(16) && B->dtype == DataType::Float(16))
return K % 16 == 0;
else if (A->dtype == DataType::NVFloat8E4M3() &&
B->dtype == DataType::NVFloat8E4M3())
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::NVFloat8E4M3() &&
B->dtype == DataType::NVFloat8E5M2())
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::NVFloat8E5M2() &&
B->dtype == DataType::NVFloat8E4M3())
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::NVFloat8E5M2() &&
B->dtype == DataType::NVFloat8E5M2())
return (!trans_A) && trans_B && K % 32 == 0;
else
return false;
} else if (C->dtype == DataType::Float(32)) {
if (A->dtype == DataType::Float(16) && B->dtype == DataType::Float(16))
return K % 16 == 0;
else if (A->dtype == DataType::BFloat(16) &&
B->dtype == DataType::BFloat(16))
return K % 16 == 0;
else if (A->dtype == DataType::Float(32) && B->dtype == DataType::Float(32))
return (!trans_A) && trans_B && K % 8 == 0;
else if (A->dtype == DataType::NVFloat8E4M3() &&
B->dtype == DataType::NVFloat8E4M3())
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::NVFloat8E4M3() &&
B->dtype == DataType::NVFloat8E5M2())
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::NVFloat8E5M2() &&
B->dtype == DataType::NVFloat8E4M3())
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::NVFloat8E5M2() &&
B->dtype == DataType::NVFloat8E5M2())
return (!trans_A) && trans_B && K % 32 == 0;
else
return false;
} else if (C->dtype == DataType::Int(32)) {
if (A->dtype == DataType::Int(8) && B->dtype == DataType::Int(8))
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::Int(8) && B->dtype == DataType::UInt(8))
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::UInt(8) && B->dtype == DataType::Int(8))
return (!trans_A) && trans_B && K % 32 == 0;
else if (A->dtype == DataType::UInt(8) && B->dtype == DataType::UInt(8))
return (!trans_A) && trans_B && K % 32 == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CheckWGMMA function contains a lot of repetitive conditions, making it difficult to read and maintain. Refactor this function to group conditions by data type and factor out common logic.[^1]

bool Gemm::CheckWGMMA() const {
  auto is_fp8 = [](const DataType& dtype) {
    return dtype == DataType::NVFloat8E4M3() || dtype == DataType::NVFloat8E5M2();
  };
  auto is_int8 = [](const DataType& dtype) {
    return dtype == DataType::Int(8) || dtype == DataType::UInt(8);
  };

  if (C->dtype == DataType::Float(16)) {
    if (A->dtype == DataType::Float(16) && B->dtype == DataType::Float(16)) return K % 16 == 0;
    if (is_fp8(A->dtype) && is_fp8(B->dtype)) return !trans_A && trans_B && K % 32 == 0;
  } else if (C->dtype == DataType::Float(32)) {
    if (A->dtype == DataType::Float(16) && B->dtype == DataType::Float(16)) return K % 16 == 0;
    if (A->dtype == DataType::BFloat(16) && B->dtype == DataType::BFloat(16)) return K % 16 == 0;
    if (A->dtype == DataType::Float(32) && B->dtype == DataType::Float(32)) return !trans_A && trans_B && K % 8 == 0;
    if (is_fp8(A->dtype) && is_fp8(B->dtype)) return !trans_A && trans_B && K % 32 == 0;
  } else if (C->dtype == DataType::Int(32)) {
    if (is_int8(A->dtype) && is_int8(B->dtype)) return !trans_A && trans_B && K % 32 == 0;
  }
  return false;
}


for (size_t i{1}; i < desc.global_stride.size(); i++) {
unsigned long long stride = desc.global_stride[i].as<IntImmNode>()->value;
if (stride % 16 != 0 || stride >= (1ULL << 40)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 16 should be replaced with a named constant to improve readability and maintainability.

Suggested change
if (stride % 16 != 0 || stride >= (1ULL << 40)) {
constexpr int kTMAStrideAlignment = 16;
if (stride % kTMAStrideAlignment != 0 || stride >= (1ULL << 40)) {


for (size_t i{1}; i < desc.global_stride.size(); i++) {
unsigned long long stride = desc.global_stride[i].as<IntImmNode>()->value;
if (stride % 16 != 0 || stride >= (1ULL << 40)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number (1ULL << 40) should be replaced with a named constant to improve readability and maintainability.

Suggested change
if (stride % 16 != 0 || stride >= (1ULL << 40)) {
constexpr unsigned long long kTMAMaxStride = (1ULL << 40);
if (stride % 16 != 0 || stride >= kTMAMaxStride) {

Comment on lines +267 to +268
if (desc.swizzle == static_cast<int>(CU_TENSOR_MAP_SWIZZLE_NONE) &&
inner_box_dim_ % 256 != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding braces {} to the if statement for better readability and to avoid potential dangling-else issues.

Suggested change
if (desc.swizzle == static_cast<int>(CU_TENSOR_MAP_SWIZZLE_NONE) &&
inner_box_dim_ % 256 != 0)
if (desc.swizzle == static_cast<int>(CU_TENSOR_MAP_SWIZZLE_NONE) && inner_box_dim_ % 256 != 0) {
return Stmt();
}


int inner_box_dim_ = instruction_dim * shared_tensor->dtype.bytes();

if (desc.swizzle == static_cast<int>(CU_TENSOR_MAP_SWIZZLE_NONE) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 256 should be replaced with a named constant to improve readability and maintainability.

Suggested change
if (desc.swizzle == static_cast<int>(CU_TENSOR_MAP_SWIZZLE_NONE) &&
constexpr int kInnerBoxDimThreshold = 256;
if (desc.swizzle == static_cast<int>(CU_TENSOR_MAP_SWIZZLE_NONE) && inner_box_dim_ % kInnerBoxDimThreshold != 0)

Comment on lines +766 to +781
for (auto region : access[0]) {
auto var = region->buffer->data;
if (buffer_data_to_buffer_.count(var)) {
read_set.insert(buffer_data_to_buffer_[var].get());
} else {
read_set.insert(region->buffer.get());
}
}
for (auto region : access[1]) {
auto var = region->buffer->data;
if (buffer_data_to_buffer_.count(var)) {
write_set.insert(buffer_data_to_buffer_[var].get());
} else {
write_set.insert(region->buffer.get());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for populating read_set and write_set is duplicated. Extract the common logic into a helper function or lambda to improve code maintainability.[^1]

Suggested change
for (auto region : access[0]) {
auto var = region->buffer->data;
if (buffer_data_to_buffer_.count(var)) {
read_set.insert(buffer_data_to_buffer_[var].get());
} else {
read_set.insert(region->buffer.get());
}
}
for (auto region : access[1]) {
auto var = region->buffer->data;
if (buffer_data_to_buffer_.count(var)) {
write_set.insert(buffer_data_to_buffer_[var].get());
} else {
write_set.insert(region->buffer.get());
}
}
auto populate_access_set = [&](std::set<const BufferNode*>& set, const Array<BufferRegion>& regions) {
for (auto region : regions) {
auto it = buffer_data_to_buffer_.find(region->buffer->data);
if (it != buffer_data_to_buffer_.end()) {
set.insert(it->second.get());
} else {
set.insert(region->buffer.get());
}
}
};
populate_access_set(read_set, access[0]);
populate_access_set(write_set, access[1]);

LeiWang1999 and others added 8 commits July 21, 2025 23:08
- Removed unnecessary blank lines for improved code readability.
- Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides.
- Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings.
- Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`.
- Removed unused imports and commented-out code in various example files to enhance readability and maintainability.
- Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`.
- Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity.
- Deleted the `example_mha_inference.py` file as it is no longer needed.
- Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment.
- Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment.
@LeiWang1999 LeiWang1999 marked this pull request as ready for review July 22, 2025 18:13
… for wheel mode

- Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment.
@LeiWang1999
Copy link
Member

Phase out test_tilelang_issue_101.py as it has never been correct.

@LeiWang1999 LeiWang1999 changed the title [Bugfix] Fix CI bugs in Hopper [Bugfix][CI] Bug fixing and migrate CI from ada to hopper Jul 23, 2025
@LeiWang1999 LeiWang1999 merged commit e9a608e into tile-ai:main Jul 23, 2025
1 of 3 checks passed
@xwhzz xwhzz deleted the ci_fix branch July 24, 2025 06:40
RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
* fix CI bugs in hopper

* lint fix

* Update bulk_copy.cc

* Refactor bulk copy logic in LowerBulkCopy function

- Removed unnecessary blank lines for improved code readability.
- Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides.
- Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings.

* test fix

* ci fix

* Update flash-attention dependencies and clean up example code

- Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`.
- Removed unused imports and commented-out code in various example files to enhance readability and maintainability.
- Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`.
- Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity.
- Deleted the `example_mha_inference.py` file as it is no longer needed.

* Update CI workflow to remove `--user` flag from pip install commands

- Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment.

* Update CI workflow to include `--no-user` flag in pip install commands

- Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment.

* Update CI workflow to include `--no-user` flag in pip install command for wheel mode

- Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment.

* test fix

* avoid conflict with system environments

* test fix

* add commnets

---------

Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] PyTest Failures on H100 [Bug] Multiple Test Failures on H100

2 participants