[AMD] fix bugs in warp shuffle #790

txs19991 · 2025-09-05T09:19:34Z

Each warp on AMD contains 64 lanes, so calling T.shfl_xor, T.shfl_down, and T.shfl_up causes a core dump. Moreover, AMD GPUs guarantee that all warp lanes are executed in lockstep; therefore, we use shuffle operations without additional synchronization, which provides better performance.

Summary by CodeRabbit

New Features
- Improved cross-platform GPU compatibility by automatically using native shuffle operations on HIP-enabled environments.
Bug Fixes
- Ensures reliable warp-level data exchange across different GPU backends without requiring changes to user code.
Chores
- Added environment-aware handling for GPU shuffle behavior while keeping the public API and function signatures unchanged.

coderabbitai · 2025-09-05T09:19:41Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a HIP availability check and a module flag in tilelang/language/builtin.py. The warp shuffle wrappers (shfl_xor, shfl_down, shfl_up) now query HIP availability and dispatch to HIP intrinsics (__shfl_*) when available; otherwise they call CUDA sync intrinsics (__shfl_*_sync(0xffffffff, ...)). Signatures unchanged.

Changes

Cohort / File(s)	Summary of changes
HIP-aware shuffle wrappers `tilelang/language/builtin.py`	Imported `check_hip_availability`, added module flag `_IS_HIP_AVAILABLE = check_hip_availability()`, and updated `shfl_xor`, `shfl_down`, `shfl_up` to call `__shfl_` when HIP is available; otherwise call `__shfl__sync(0xffffffff, ...)`. Public function signatures unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant B as builtin.shuffle (wrapper)
  participant F as _IS_HIP_AVAILABLE (module flag)
  participant H as HIP intrinsic (__shfl_*)
  participant U as CUDA intrinsic (__shfl_*_sync)

  C->>B: shfl_{xor,down,up}(value, offset)
  B->>F: read _IS_HIP_AVAILABLE
  alt HIP available (true)
    B->>H: __shfl_*(value, offset)
    H-->>B: result
  else HIP not available (false)
    B->>U: __shfl_*_sync(0xffffffff, value, offset)
    U-->>B: result
  end
  B-->>C: result

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I shuffle bits with nimble hop,
HIP or CUDA—flip or flop.
I sniff the wind, then choose a lane,
Down, up, xor—warp’s refrain.
Ears perked high in silicon night,
I pick the path that’s just right.
Thump-thump—performance in sight! 🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d26fd59 and 2e033c9.

📒 Files selected for processing (1)

tilelang/language/builtin.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tilelang/language/builtin.py

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-09-05T09:19:45Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist

Summary of Changes

Hello @txs19991, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical stability issues on AMD GPUs related to warp shuffle operations, which previously led to core dumps. By intelligently adapting the shuffle intrinsic calls based on the target platform, it ensures correct and more performant execution on AMD hardware, leveraging their unique architectural guarantees.

Highlights

AMD Warp Shuffle Fix: Addresses core dumps on AMD GPUs when using shfl_xor, shfl_down, and shfl_up by adapting the shuffle operations for AMD's 64-lane warps.
Conditional Shuffle Implementation: Implements conditional logic to use non-synchronizing shuffle intrinsics (__shfl_xor, __shfl_down, __shfl_up) when running on AMD platforms (checked via check_hip_availability), while retaining _sync versions for other platforms.
Performance Improvement: Leverages AMD GPU's lockstep execution guarantee to use shuffle operations without additional synchronization, leading to improved performance.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes a bug with warp shuffle operations on AMD GPUs by using the appropriate non-synchronized shuffle intrinsics for HIP environments. My review includes suggestions to improve code readability and performance.

Specifically, I've pointed out that check_hip_availability() is called multiple times, which could be inefficient. Caching the result at the module level would be a better approach. Additionally, the conditional logic in the shuffle functions is written using long ternary expressions, which harm readability. I've suggested refactoring these into if/else blocks.

Addressing these points will make the code more maintainable and performant.

gemini-code-assist · 2025-09-05T09:20:54Z

tilelang/language/builtin.py

 from tilelang import tvm as tvm
 from tilelang.language import ptx_arrive_barrier, evaluate
 from tilelang.language.kernel import get_thread_bindings, get_block_extents
+from tilelang.utils.target import check_hip_availability


The imported function check_hip_availability() is called every time one of the shuffle functions (shfl_xor, shfl_down, shfl_up) is invoked. The implementation of this check may perform file system lookups, which can be inefficient if called repeatedly. To improve performance, the result of this check should be cached at the module level.

For example:

# After imports _IS_HIP_AVAILABLE = check_hip_availability() # In shuffle functions if _IS_HIP_AVAILABLE: # ...

gemini-code-assist · 2025-09-05T09:20:54Z

tilelang/language/builtin.py

        tir.Call: A handle to the shuffle operation
    """
-    return tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
+    return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)


This line is quite long and can be difficult to read. Using an if/else block would improve code clarity and maintainability.

Suggested change

return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)

if check_hip_availability():

return tir.call_extern(value.dtype, "__shfl_xor", value, offset)

else:

return tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)

gemini-code-assist · 2025-09-05T09:20:54Z

tilelang/language/builtin.py

            The value to shuffle
    """
-    return tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
+    return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)


For consistency and readability, this line should also be refactored into an if/else block, similar to the suggestion for shfl_xor.

Suggested change

return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)

if check_hip_availability():

return tir.call_extern(value.dtype, "__shfl_down", value, offset)

else:

return tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)

gemini-code-assist · 2025-09-05T09:20:54Z

tilelang/language/builtin.py

            The value to shuffle
    """
-    return tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
+    return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)


To maintain a consistent and readable style across all shuffle functions, please refactor this line into an if/else block.

Suggested change

return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)

if check_hip_availability():

return tir.call_extern(value.dtype, "__shfl_up", value, offset)

else:

return tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

tilelang/language/builtin.py (1)
299-319: Consider optional width to preserve 32-lane algorithms on AMD (wave64).

HIP defaults width=warpSize (64 on AMD). If existing kernels assume 32-lane reductions, expose an optional width to force 32 when needed, e.g., __shfl_down(val, off, 32). Backward-compatible API:
-def shfl_xor(value: Union[int, PrimExpr, tir.Call], offset: Union[int, PrimExpr, tir.Call]):
+def shfl_xor(value: Union[int, PrimExpr, tir.Call], offset: Union[int, PrimExpr, tir.Call], width: Union[int, PrimExpr, None] = None):
@@
-    if _is_hip_target():
-        return tir.call_extern(value.dtype, "__shfl_xor", value, offset)
+    if _is_hip_target():
+        return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if width is None \
+            else tir.call_extern(value.dtype, "__shfl_xor", value, offset, width)
@@
-    return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)
+    return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset) if width is None \
+        else tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset, width)
Apply similarly to shfl_down and shfl_up. If you prefer to keep the public API unchanged, at least audit call sites that rely on 32-wide behavior. Would you like a quick grep script to find likely 32-lane reduction loops?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e0c350 and d26fd59.

📒 Files selected for processing (1)

tilelang/language/builtin.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tilelang/language/builtin.py (2)

tilelang/utils/target.py (1)

check_hip_availability (30-40)

tilelang/language/tir/op.py (1)

call_extern (172-194)

coderabbitai · 2025-09-05T09:30:47Z

tilelang/language/builtin.py

 from tilelang import tvm as tvm
 from tilelang.language import ptx_arrive_barrier, evaluate
 from tilelang.language.kernel import get_thread_bindings, get_block_extents
+from tilelang.utils.target import check_hip_availability


⚠️ Potential issue

Don't select HIP/CUDA via host availability; dispatch by TVM Target.

check_hip_availability() reflects the build host, not the compilation target. On machines with both ROCm and CUDA installed (or during cross-compilation), this can emit HIP intrinsics while targeting CUDA (or vice versa), leading to compile errors or miscompiled kernels.

Introduce a target-aware helper and use it in the shfl wrappers.

+def _is_hip_target() -> bool: + tgt = tvm.target.Target.current(allow_none=True) + if tgt is not None: + kind = getattr(tgt, "kind", None) + name = getattr(kind, "name", "") + return name in ("rocm", "hip", "amdgpu") + # Fallback for contexts where Target is not set yet. + return check_hip_availability()

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tilelang/language/builtin.py around line 6, the code imports and uses check_hip_availability (which reflects the host) to choose HIP vs CUDA; replace that with a TVM-target-aware helper. Add a small helper (e.g., is_target_hip(target) / resolve_accelerator_for_target(target)) that inspects the provided TVM Target object or target string (checking target.kind.name or target.arch/target.attrs or substring matching like "rocm"/"amdgcn" vs "cuda"/"nvptx") and returns a boolean or enum indicating HIP vs CUDA; remove the import of check_hip_availability and update the shfl wrapper functions to accept/receive the TVM target (or derive it from context) and call the new helper to decide which intrinsics to emit so selection is based on compilation target rather than host availability.

coderabbitai · 2025-09-05T09:30:47Z

tilelang/language/builtin.py

        tir.Call: A handle to the shuffle operation
    """
-    return tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
+    return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)


💡 Verification agent

🧩 Analysis chain

Use target-aware dispatch and pass the active mask (not 0xffffffff) on CUDA.

Switch to _is_hip_target() (see previous comment).

Use __activemask() to avoid deadlocks in divergent warps; 0xffffffff assumes all lanes active.

- return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset) + if _is_hip_target(): + return tir.call_extern(value.dtype, "__shfl_xor", value, offset) + active = tir.call_extern("uint32", "__activemask") + return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)

Note: If callers pass Python ints for value, value.dtype will fail. If that’s intended, consider wrapping ints with tir.const(..) at call sites or enforcing PrimExpr here.

🏁 Script executed:

#!/bin/bash set -e echo "Searching for check_hip_availability definitions and usages" rg -n "def check_hip_availability" -C2 . rg -n "check_hip_availability(" -C2 . echo "Searching for _is_hip_target definitions and usages" rg -n "def _is_hip_target" -C2 . rg -n "_is_hip_target(" -C2 . echo "Searching for __activemask usages" rg -n "__activemask" -C2 . echo "Inspecting shfl_xor implementation context" rg -n "shfl_xor" -C3 tilelang/language/builtin.py

Length of output: 593

🏁 Script executed:

#!/bin/bash set -e echo "=== Opening tilelang/utils/target.py (lines 1-200) ===" sed -n '1,200p' tilelang/utils/target.py || true echo "=== Searching for is_hip_target or hip_target aliases ===" rg -n "hip_target" -C2 tilelang/utils/target.py || true rg -n "is_hip_target" -C2 tilelang/utils/target.py || true echo "=== Searching for CUDA/ROCm availability utilities ===" rg -n "check_cuda" -C2 tilelang/utils/target.py || true rg -n "cuda" -C2 tilelang/utils/target.py || true echo "=== Inspecting shfl_xor context in builtin.py (lines 250-350) ===" sed -n '250,350p' tilelang/language/builtin.py || true echo "=== Searching for other shuffle/intrinsic patterns ===" rg -n "__shfl" -C5 tilelang/language/builtin.py || true rg -n "activemask" -C2 tilelang/language/builtin.py || true

Length of output: 9113

🏁 Script executed:

#!/bin/bash set -e echo "=== Opening tilelang/language/builtin.py (lines 1-50) ===" sed -n '1,50p' tilelang/language/builtin.py

Length of output: 1632

Use compile-time target dispatch and real active mask
Switch from the host‐availability check to inspecting the compilation target via tvm.target.Target.current(), and replace the hard-coded 0xffffffff with __activemask(). For example:

-from tilelang.utils.target import check_hip_availability +from tilelang import tvm def shfl_xor(value: Union[int, PrimExpr, tir.Call], offset: Union[int, PrimExpr, tir.Call]): - return tir.call_extern(value.dtype, "__shfl_xor", value, offset) \ - if check_hip_availability() \ - else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset) + tgt = tvm.target.Target.current(allow_none=False) + if tgt.kind.name == "rocm": + return tir.call_extern(value.dtype, "__shfl_xor", value, offset) + active = tir.call_extern("uint32", "__activemask") + return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)

Apply the same pattern to shfl_down and shfl_up.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)

from tilelang import tvm

def shfl_xor(value: Union[int, PrimExpr, tir.Call],

offset: Union[int, PrimExpr, tir.Call]):

- return tir.call_extern(value.dtype, "__shfl_xor", value, offset) \

- if check_hip_availability() \

tgt = tvm.target.Target.current(allow_none=False)

if tgt.kind.name == "rocm":

return tir.call_extern(value.dtype, "__shfl_xor", value, offset)

active = tir.call_extern("uint32", "__activemask")

return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)

🤖 Prompt for AI Agents

In tilelang/language/builtin.py around line 299, the current runtime host check and hard-coded mask for calling CUDA/HIP shuffle intrinsics should be replaced with compile-time target dispatch: use tvm.target.Target.current() to detect if the target is an AMD HSA/ROCM target (HIP) versus NVPTX/CUDA, and call the appropriate intrinsic name accordingly; also replace the literal 0xffffffff with the device intrinsic __activemask() when emitting the sync variant. Make the same changes in shfl_down and shfl_up: use Target.current() to choose "__shfl_xor" vs "__shfl_xor_sync" (or corresponding down/up names), and use __activemask() as the active mask argument for the _sync calls.

coderabbitai · 2025-09-05T09:30:47Z

tilelang/language/builtin.py

            The value to shuffle
    """
-    return tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
+    return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)


🛠️ Refactor suggestion

Mirror the XOR fix for shfl_down.

Apply target-aware dispatch and active mask usage.

- return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset) + if _is_hip_target(): + return tir.call_extern(value.dtype, "__shfl_down", value, offset) + active = tir.call_extern("uint32", "__activemask") + return tir.call_extern(value.dtype, "__shfl_down_sync", active, value, offset)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)

if _is_hip_target():

return tir.call_extern(value.dtype, "__shfl_down", value, offset)

active = tir.call_extern("uint32", "__activemask")

return tir.call_extern(value.dtype, "__shfl_down_sync", active, value, offset)

🤖 Prompt for AI Agents

In tilelang/language/builtin.py around line 309, the shfl_down call must mirror the XOR fix: make the call target-aware and supply the CUDA active mask; replace the current single-path call with a conditional that calls "__shfl_down" (no mask) when HIP is active and calls "__shfl_down_sync" with the active mask (0xffffffff) as the first argument when CUDA is active, and ensure the same XOR-based lane handling used elsewhere is preserved for shfl_down.

coderabbitai · 2025-09-05T09:30:47Z

tilelang/language/builtin.py

            The value to shuffle
    """
-    return tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
+    return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)


🛠️ Refactor suggestion

Mirror the XOR fix for shfl_up.

Apply target-aware dispatch and active mask usage.

- return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset) + if _is_hip_target(): + return tir.call_extern(value.dtype, "__shfl_up", value, offset) + active = tir.call_extern("uint32", "__activemask") + return tir.call_extern(value.dtype, "__shfl_up_sync", active, value, offset)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)

if _is_hip_target():

return tir.call_extern(value.dtype, "__shfl_up", value, offset)

active = tir.call_extern("uint32", "__activemask")

return tir.call_extern(value.dtype, "__shfl_up_sync", active, value, offset)

🤖 Prompt for AI Agents

In tilelang/language/builtin.py around line 319, the shfl_up call must mirror the XOR fix used for other shuffle ops by using target-aware dispatch and providing the active mask for the sync variant; replace the single inline expression with a branch that calls tir.call_extern(value.dtype, "__shfl_up", value, offset) when check_hip_availability() is true, and tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset) when false, ensuring the call signature/order matches the other shfl_* helpers and any XOR-based lane-index adjustment applied elsewhere is applied here too.

* [AMD] fix bugs in warp shuffle * format --------- Co-authored-by: tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>

[AMD] fix bugs in warp shuffle

d26fd59

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

format

2e033c9

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

LeiWang1999 approved these changes Sep 5, 2025

View reviewed changes

LeiWang1999 merged commit cda5ea1 into tile-ai:main Sep 5, 2025
6 checks passed

RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025

[AMD] fix bugs in warp shuffle (tile-ai#790)

908f1a3

* [AMD] fix bugs in warp shuffle * format --------- Co-authored-by: tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>

-    return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
+from tilelang import tvm
+ def shfl_xor(value: Union[int, PrimExpr, tir.Call],
+               offset: Union[int, PrimExpr, tir.Call]):
+-    return tir.call_extern(value.dtype, "__shfl_xor", value, offset) \
+-        if check_hip_availability() \
+    tgt = tvm.target.Target.current(allow_none=False)
+    if tgt.kind.name == "rocm":
+        return tir.call_extern(value.dtype, "__shfl_xor", value, offset)
+    active = tir.call_extern("uint32", "__activemask")
+    return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)

[AMD] fix bugs in warp shuffle #790

[AMD] fix bugs in warp shuffle #790

Uh oh!

Conversation

txs19991 commented Sep 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

txs19991 commented Sep 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 5, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)