-
Notifications
You must be signed in to change notification settings - Fork 332
[AMD] fix bf16x2 dtype codegen #847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughUpdates adjust HIP bfloat16 vector handling and modify a GEMM test’s shared-memory tiling. Specifically, the HIP codegen switches a load cast to bfloat16x2, the bfloat16x2 struct changes from an array to named fields, and the test expands C_shared to a 4D layout with corresponding index mapping. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Paran0idy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses an issue with bfloat16x2 data type codegen for AMD's HIP backend. It corrects the type casting used in the codegen and refines the bfloat16x2 struct definition for better compatibility. Additionally, a related test case has been adjusted to ensure accurate shape calculations.
Highlights
- HIP Codegen Correction: The codegen for bfloat16x2 data types on HIP targets has been fixed by replacing the NVIDIA-specific
nv_bfloat162type with the more genericbfloat16x2. - bfloat16x2 Struct Definition: The
bfloat16x2struct incommon.hhas been updated to define its components asxandymembers instead of adata[2]array, aligning with common vector type conventions. - Test Shape Adjustment: A test for TileLang GEMM with MFMA intrinsics for AMD has been updated to correctly calculate the
C_shared_shapeby includingblock_M // micro_size_x.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes bf16x2 dtype codegen for AMD targets. The changes involve updating the bfloat16x2 struct definition to use named members x and y and modifying the HIP codegen to use this new struct, replacing the NVIDIA-specific nv_bfloat162. A related fix is included for a test file.
The changes are correct and address the issue. I've added a high-severity comment in codegen_hip.cc to fix a latent bug in vector element loading for 2-lane vectors, which would cause compilation errors. I've also added a medium-severity comment in common.h suggesting to make other bfloat16xN structs consistent with the new bfloat16x2 definition for better maintainability.
| os << "((bfloat16x2*)(&(" << vec << "." << access[i / 2] << ")))->" | ||
| << access[i % 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change from nv_bfloat162 to bfloat16x2 is correct for HIP, there's an underlying issue with vector element loading for 2-element vectors.
For a bfloat16 vector with 2 lanes, PrintType generates a scalar uint type for the vector variable vec. The current code then attempts to access a member (e.g., .x) of this scalar uint (vec << "." << access[i / 2]), which is incorrect and will cause compilation errors in the generated code.
The logic should handle 2-lane vectors (represented as a scalar uint) differently from wider vectors (e.g., uint2, uint4). For the 2-lane case, the address of the scalar vec should be cast directly, without member access.
A similar issue exists for float16 vectors on lines 480-481.
| os << "((bfloat16x2*)(&(" << vec << "." << access[i / 2] << ")))->" | |
| << access[i % 2]; | |
| os << "((bfloat16x2*)(&(" << (t.lanes() == 2 ? vec : (vec + "." + access[i / 2])) << ")))->" | |
| << access[i % 2]; |
|
|
||
| struct bfloat16x2 { | ||
| bfloat16_t data[2]; | ||
| bfloat16_t x, y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change to bfloat16x2 is correct to support member access like .x and .y. For consistency, you should consider updating bfloat16x4, bfloat16x8, and bfloat16x16 to also use named members (x, y, z, w, etc.) or nested structs instead of a data array. This would make the API for these vector-like structs more uniform and easier to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)
214-214: Fix typo in matrix transpose operation.There's a typo in the matrix transpose operation -
A.Ttoshould beA.T.to.Apply this diff to fix the typo:
- ref_c = torch.matmul(A.Tto(torch.float32), + ref_c = torch.matmul(A.T.to(torch.float32),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/target/codegen_hip.cc(1 hunks)src/tl_templates/hip/common.h(1 hunks)testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: bot-task
🔇 Additional comments (3)
src/tl_templates/hip/common.h (1)
69-71: LGTM! Improved field access pattern for bfloat16x2.The change from an array
data[2]to named fieldsx, yaligns with HIP's standard bfloat16x2 conventions. Based on the AMD HIP documentation, this pattern is consistent with the__hip_bfloat162structure which uses namedx, yfields. This improves code readability and maintains compatibility with HIP's vectorized type access patterns.testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)
58-63: LGTM! C_shared shape expanded to align with micro-tile access pattern.The change from 3D to 4D shared memory layout creates a more structured tiling pattern that aligns with the micro-tile sizes. The new shape
(block_M // micro_size_x, block_N // micro_size_y, micro_size_x, micro_size_y)enables efficient access patterns for the updated bfloat16 vectorization, where elements can be accessed using the pattern shown in lines 155-160.This change is consistent with the bfloat16x2 improvements in the HIP codegen and header files, ensuring proper memory layout for vectorized operations.
src/target/codegen_hip.cc (1)
483-484: Approve — bfloat16x2 access matches updated struct.src/tl_templates/hip/common.h defines
struct bfloat16x2 { bfloat16_t x, y; }and src/target/codegen_hip.cc (lines 483–484) uses((bfloat16x2*)(&(...)))->access[i%2], which correctly maps tox/y. Search found no remainingbfloat16x2.data[...]usages.
Summary by CodeRabbit