Skip to content

Conversation

@Paran0idy
Copy link
Contributor

@Paran0idy Paran0idy commented Sep 18, 2025

  • fix bf16x2 dtype codegen

Summary by CodeRabbit

  • Bug Fixes
    • Improved bfloat16 vector load handling on HIP, enhancing stability and correctness on AMD GPUs.
  • Refactor
    • Harmonized bfloat16 pair representation to improve clarity and cross-backend consistency.
  • Tests
    • Expanded GEMM MFMA tests to use a 4D shared-memory layout for C, increasing coverage of tiling and write-back paths.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

Walkthrough

Updates adjust HIP bfloat16 vector handling and modify a GEMM test’s shared-memory tiling. Specifically, the HIP codegen switches a load cast to bfloat16x2, the bfloat16x2 struct changes from an array to named fields, and the test expands C_shared to a 4D layout with corresponding index mapping.

Changes

Cohort / File(s) Change summary
HIP bfloat16 vector handling
src/target/codegen_hip.cc, src/tl_templates/hip/common.h
Codegen: changed bfloat16 load cast from nv_bfloat162 to bfloat16x2. Struct: bfloat16x2 storage updated from bfloat16_t data[2] to bfloat16_t x, y, altering field access semantics.
AMD GEMM test tiling update
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py
Expanded C_shared_shape from 3D to 4D: (block_M // micro_size_x, block_N // micro_size_y, micro_size_x, micro_size_y); adjusted indexing to match. Other shared shapes unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • LeiWang1999

Poem

I twitch my ears at tiles that grew,
From three to four, a tidy view.
Two bfloats dance as x and y,
In HIP they pair and swiftly fly.
I stamp my paw—tests still hum—
Bits aligned, the carrots come! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "[AMD] fix bf16x2 dtype codegen" is concise and accurately describes the main intent of the changeset—fixing bf16x2/bfloat16 code generation for the AMD/HIP target as evidenced by updates to codegen and the bfloat16x2 struct. It is specific, relevant to the modified files, and clear for a reviewer scanning history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Paran0idy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue with bfloat16x2 data type codegen for AMD's HIP backend. It corrects the type casting used in the codegen and refines the bfloat16x2 struct definition for better compatibility. Additionally, a related test case has been adjusted to ensure accurate shape calculations.

Highlights

  • HIP Codegen Correction: The codegen for bfloat16x2 data types on HIP targets has been fixed by replacing the NVIDIA-specific nv_bfloat162 type with the more generic bfloat16x2.
  • bfloat16x2 Struct Definition: The bfloat16x2 struct in common.h has been updated to define its components as x and y members instead of a data[2] array, aligning with common vector type conventions.
  • Test Shape Adjustment: A test for TileLang GEMM with MFMA intrinsics for AMD has been updated to correctly calculate the C_shared_shape by including block_M // micro_size_x.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes bf16x2 dtype codegen for AMD targets. The changes involve updating the bfloat16x2 struct definition to use named members x and y and modifying the HIP codegen to use this new struct, replacing the NVIDIA-specific nv_bfloat162. A related fix is included for a test file.

The changes are correct and address the issue. I've added a high-severity comment in codegen_hip.cc to fix a latent bug in vector element loading for 2-lane vectors, which would cause compilation errors. I've also added a medium-severity comment in common.h suggesting to make other bfloat16xN structs consistent with the new bfloat16x2 definition for better maintainability.

Comment on lines +483 to 484
os << "((bfloat16x2*)(&(" << vec << "." << access[i / 2] << ")))->"
<< access[i % 2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change from nv_bfloat162 to bfloat16x2 is correct for HIP, there's an underlying issue with vector element loading for 2-element vectors.

For a bfloat16 vector with 2 lanes, PrintType generates a scalar uint type for the vector variable vec. The current code then attempts to access a member (e.g., .x) of this scalar uint (vec << "." << access[i / 2]), which is incorrect and will cause compilation errors in the generated code.

The logic should handle 2-lane vectors (represented as a scalar uint) differently from wider vectors (e.g., uint2, uint4). For the 2-lane case, the address of the scalar vec should be cast directly, without member access.

A similar issue exists for float16 vectors on lines 480-481.

Suggested change
os << "((bfloat16x2*)(&(" << vec << "." << access[i / 2] << ")))->"
<< access[i % 2];
os << "((bfloat16x2*)(&(" << (t.lanes() == 2 ? vec : (vec + "." + access[i / 2])) << ")))->"
<< access[i % 2];


struct bfloat16x2 {
bfloat16_t data[2];
bfloat16_t x, y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change to bfloat16x2 is correct to support member access like .x and .y. For consistency, you should consider updating bfloat16x4, bfloat16x8, and bfloat16x16 to also use named members (x, y, z, w, etc.) or nested structs instead of a data array. This would make the API for these vector-like structs more uniform and easier to use.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)

214-214: Fix typo in matrix transpose operation.

There's a typo in the matrix transpose operation - A.Tto should be A.T.to.

Apply this diff to fix the typo:

-        ref_c = torch.matmul(A.Tto(torch.float32),
+        ref_c = torch.matmul(A.T.to(torch.float32),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebea77d and 944b0a1.

📒 Files selected for processing (3)
  • src/target/codegen_hip.cc (1 hunks)
  • src/tl_templates/hip/common.h (1 hunks)
  • testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: bot-task
🔇 Additional comments (3)
src/tl_templates/hip/common.h (1)

69-71: LGTM! Improved field access pattern for bfloat16x2.

The change from an array data[2] to named fields x, y aligns with HIP's standard bfloat16x2 conventions. Based on the AMD HIP documentation, this pattern is consistent with the __hip_bfloat162 structure which uses named x, y fields. This improves code readability and maintains compatibility with HIP's vectorized type access patterns.

testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)

58-63: LGTM! C_shared shape expanded to align with micro-tile access pattern.

The change from 3D to 4D shared memory layout creates a more structured tiling pattern that aligns with the micro-tile sizes. The new shape (block_M // micro_size_x, block_N // micro_size_y, micro_size_x, micro_size_y) enables efficient access patterns for the updated bfloat16 vectorization, where elements can be accessed using the pattern shown in lines 155-160.

This change is consistent with the bfloat16x2 improvements in the HIP codegen and header files, ensuring proper memory layout for vectorized operations.

src/target/codegen_hip.cc (1)

483-484: Approve — bfloat16x2 access matches updated struct.

src/tl_templates/hip/common.h defines struct bfloat16x2 { bfloat16_t x, y; } and src/target/codegen_hip.cc (lines 483–484) uses ((bfloat16x2*)(&(...)))->access[i%2], which correctly maps to x/y. Search found no remaining bfloat16x2.data[...] usages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants