Skip to content

Conversation

@txs19991
Copy link
Contributor

@txs19991 txs19991 commented Sep 5, 2025

Each warp on AMD contains 64 lanes, so calling T.shfl_xor, T.shfl_down, and T.shfl_up causes a core dump. Moreover, AMD GPUs guarantee that all warp lanes are executed in lockstep; therefore, we use shuffle operations without additional synchronization, which provides better performance.

Summary by CodeRabbit

  • New Features
    • Improved cross-platform GPU compatibility by automatically using native shuffle operations on HIP-enabled environments.
  • Bug Fixes
    • Ensures reliable warp-level data exchange across different GPU backends without requiring changes to user code.
  • Chores
    • Added environment-aware handling for GPU shuffle behavior while keeping the public API and function signatures unchanged.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a HIP availability check and a module flag in tilelang/language/builtin.py. The warp shuffle wrappers (shfl_xor, shfl_down, shfl_up) now query HIP availability and dispatch to HIP intrinsics (__shfl_*) when available; otherwise they call CUDA sync intrinsics (__shfl_*_sync(0xffffffff, ...)). Signatures unchanged.

Changes

Cohort / File(s) Summary of changes
HIP-aware shuffle wrappers
tilelang/language/builtin.py
Imported check_hip_availability, added module flag _IS_HIP_AVAILABLE = check_hip_availability(), and updated shfl_xor, shfl_down, shfl_up to call __shfl_* when HIP is available; otherwise call __shfl_*_sync(0xffffffff, ...). Public function signatures unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant B as builtin.shuffle (wrapper)
  participant F as _IS_HIP_AVAILABLE (module flag)
  participant H as HIP intrinsic (__shfl_*)
  participant U as CUDA intrinsic (__shfl_*_sync)

  C->>B: shfl_{xor,down,up}(value, offset)
  B->>F: read _IS_HIP_AVAILABLE
  alt HIP available (true)
    B->>H: __shfl_*(value, offset)
    H-->>B: result
  else HIP not available (false)
    B->>U: __shfl_*_sync(0xffffffff, value, offset)
    U-->>B: result
  end
  B-->>C: result
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I shuffle bits with nimble hop,
HIP or CUDA—flip or flop.
I sniff the wind, then choose a lane,
Down, up, xor—warp’s refrain.
Ears perked high in silicon night,
I pick the path that’s just right.
Thump-thump—performance in sight! 🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d26fd59 and 2e033c9.

📒 Files selected for processing (1)
  • tilelang/language/builtin.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tilelang/language/builtin.py
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link

github-actions bot commented Sep 5, 2025

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @txs19991, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical stability issues on AMD GPUs related to warp shuffle operations, which previously led to core dumps. By intelligently adapting the shuffle intrinsic calls based on the target platform, it ensures correct and more performant execution on AMD hardware, leveraging their unique architectural guarantees.

Highlights

  • AMD Warp Shuffle Fix: Addresses core dumps on AMD GPUs when using shfl_xor, shfl_down, and shfl_up by adapting the shuffle operations for AMD's 64-lane warps.
  • Conditional Shuffle Implementation: Implements conditional logic to use non-synchronizing shuffle intrinsics (__shfl_xor, __shfl_down, __shfl_up) when running on AMD platforms (checked via check_hip_availability), while retaining _sync versions for other platforms.
  • Performance Improvement: Leverages AMD GPU's lockstep execution guarantee to use shuffle operations without additional synchronization, leading to improved performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug with warp shuffle operations on AMD GPUs by using the appropriate non-synchronized shuffle intrinsics for HIP environments. My review includes suggestions to improve code readability and performance.

Specifically, I've pointed out that check_hip_availability() is called multiple times, which could be inefficient. Caching the result at the module level would be a better approach. Additionally, the conditional logic in the shuffle functions is written using long ternary expressions, which harm readability. I've suggested refactoring these into if/else blocks.

Addressing these points will make the code more maintainable and performant.

from tilelang import tvm as tvm
from tilelang.language import ptx_arrive_barrier, evaluate
from tilelang.language.kernel import get_thread_bindings, get_block_extents
from tilelang.utils.target import check_hip_availability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The imported function check_hip_availability() is called every time one of the shuffle functions (shfl_xor, shfl_down, shfl_up) is invoked. The implementation of this check may perform file system lookups, which can be inefficient if called repeatedly. To improve performance, the result of this check should be cached at the module level.

For example:

# After imports
_IS_HIP_AVAILABLE = check_hip_availability()

# In shuffle functions
if _IS_HIP_AVAILABLE:
    # ...

tir.Call: A handle to the shuffle operation
"""
return tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is quite long and can be difficult to read. Using an if/else block would improve code clarity and maintainability.

Suggested change
return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
if check_hip_availability():
return tir.call_extern(value.dtype, "__shfl_xor", value, offset)
else:
return tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)

The value to shuffle
"""
return tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and readability, this line should also be refactored into an if/else block, similar to the suggestion for shfl_xor.

Suggested change
return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
if check_hip_availability():
return tir.call_extern(value.dtype, "__shfl_down", value, offset)
else:
return tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)

The value to shuffle
"""
return tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To maintain a consistent and readable style across all shuffle functions, please refactor this line into an if/else block.

Suggested change
return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
if check_hip_availability():
return tir.call_extern(value.dtype, "__shfl_up", value, offset)
else:
return tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tilelang/language/builtin.py (1)

299-319: Consider optional width to preserve 32-lane algorithms on AMD (wave64).

HIP defaults width=warpSize (64 on AMD). If existing kernels assume 32-lane reductions, expose an optional width to force 32 when needed, e.g., __shfl_down(val, off, 32). Backward-compatible API:

-def shfl_xor(value: Union[int, PrimExpr, tir.Call], offset: Union[int, PrimExpr, tir.Call]):
+def shfl_xor(value: Union[int, PrimExpr, tir.Call], offset: Union[int, PrimExpr, tir.Call], width: Union[int, PrimExpr, None] = None):
@@
-    if _is_hip_target():
-        return tir.call_extern(value.dtype, "__shfl_xor", value, offset)
+    if _is_hip_target():
+        return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if width is None \
+            else tir.call_extern(value.dtype, "__shfl_xor", value, offset, width)
@@
-    return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)
+    return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset) if width is None \
+        else tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset, width)

Apply similarly to shfl_down and shfl_up. If you prefer to keep the public API unchanged, at least audit call sites that rely on 32-wide behavior. Would you like a quick grep script to find likely 32-lane reduction loops?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e0c350 and d26fd59.

📒 Files selected for processing (1)
  • tilelang/language/builtin.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tilelang/language/builtin.py (2)
tilelang/utils/target.py (1)
  • check_hip_availability (30-40)
tilelang/language/tir/op.py (1)
  • call_extern (172-194)

from tilelang import tvm as tvm
from tilelang.language import ptx_arrive_barrier, evaluate
from tilelang.language.kernel import get_thread_bindings, get_block_extents
from tilelang.utils.target import check_hip_availability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don't select HIP/CUDA via host availability; dispatch by TVM Target.

check_hip_availability() reflects the build host, not the compilation target. On machines with both ROCm and CUDA installed (or during cross-compilation), this can emit HIP intrinsics while targeting CUDA (or vice versa), leading to compile errors or miscompiled kernels.

Introduce a target-aware helper and use it in the shfl wrappers.

+def _is_hip_target() -> bool:
+    tgt = tvm.target.Target.current(allow_none=True)
+    if tgt is not None:
+        kind = getattr(tgt, "kind", None)
+        name = getattr(kind, "name", "")
+        return name in ("rocm", "hip", "amdgpu")
+    # Fallback for contexts where Target is not set yet.
+    return check_hip_availability()

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tilelang/language/builtin.py around line 6, the code imports and uses
check_hip_availability (which reflects the host) to choose HIP vs CUDA; replace
that with a TVM-target-aware helper. Add a small helper (e.g.,
is_target_hip(target) / resolve_accelerator_for_target(target)) that inspects
the provided TVM Target object or target string (checking target.kind.name or
target.arch/target.attrs or substring matching like "rocm"/"amdgcn" vs
"cuda"/"nvptx") and returns a boolean or enum indicating HIP vs CUDA; remove the
import of check_hip_availability and update the shfl wrapper functions to
accept/receive the TVM target (or derive it from context) and call the new
helper to decide which intrinsics to emit so selection is based on compilation
target rather than host availability.

tir.Call: A handle to the shuffle operation
"""
return tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Use target-aware dispatch and pass the active mask (not 0xffffffff) on CUDA.

  • Switch to _is_hip_target() (see previous comment).
  • Use __activemask() to avoid deadlocks in divergent warps; 0xffffffff assumes all lanes active.
-    return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
+    if _is_hip_target():
+        return tir.call_extern(value.dtype, "__shfl_xor", value, offset)
+    active = tir.call_extern("uint32", "__activemask")
+    return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)

Note: If callers pass Python ints for value, value.dtype will fail. If that’s intended, consider wrapping ints with tir.const(..) at call sites or enforcing PrimExpr here.


🏁 Script executed:

#!/bin/bash
set -e

echo "Searching for check_hip_availability definitions and usages"
rg -n "def check_hip_availability" -C2 .
rg -n "check_hip_availability(" -C2 .

echo "Searching for _is_hip_target definitions and usages"
rg -n "def _is_hip_target" -C2 .
rg -n "_is_hip_target(" -C2 .

echo "Searching for __activemask usages"
rg -n "__activemask" -C2 .

echo "Inspecting shfl_xor implementation context"
rg -n "shfl_xor" -C3 tilelang/language/builtin.py

Length of output: 593


🏁 Script executed:

#!/bin/bash
set -e

echo "=== Opening tilelang/utils/target.py (lines 1-200) ==="
sed -n '1,200p' tilelang/utils/target.py || true

echo "=== Searching for is_hip_target or hip_target aliases ==="
rg -n "hip_target" -C2 tilelang/utils/target.py || true
rg -n "is_hip_target" -C2 tilelang/utils/target.py || true

echo "=== Searching for CUDA/ROCm availability utilities ==="
rg -n "check_cuda" -C2 tilelang/utils/target.py || true
rg -n "cuda" -C2 tilelang/utils/target.py || true

echo "=== Inspecting shfl_xor context in builtin.py (lines 250-350) ==="
sed -n '250,350p' tilelang/language/builtin.py || true

echo "=== Searching for other shuffle/intrinsic patterns ==="
rg -n "__shfl" -C5 tilelang/language/builtin.py || true
rg -n "activemask" -C2 tilelang/language/builtin.py || true

Length of output: 9113


🏁 Script executed:

#!/bin/bash
set -e

echo "=== Opening tilelang/language/builtin.py (lines 1-50) ==="
sed -n '1,50p' tilelang/language/builtin.py

Length of output: 1632


Use compile-time target dispatch and real active mask
Switch from the host‐availability check to inspecting the compilation target via tvm.target.Target.current(), and replace the hard-coded 0xffffffff with __activemask(). For example:

-from tilelang.utils.target import check_hip_availability
+from tilelang import tvm

 def shfl_xor(value: Union[int, PrimExpr, tir.Call],
               offset: Union[int, PrimExpr, tir.Call]):
-    return tir.call_extern(value.dtype, "__shfl_xor", value, offset) \
-        if check_hip_availability() \
-        else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
+    tgt = tvm.target.Target.current(allow_none=False)
+    if tgt.kind.name == "rocm":
+        return tir.call_extern(value.dtype, "__shfl_xor", value, offset)
+    active = tir.call_extern("uint32", "__activemask")
+    return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)

Apply the same pattern to shfl_down and shfl_up.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return tir.call_extern(value.dtype, "__shfl_xor", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_xor_sync", 0xffffffff, value, offset)
from tilelang import tvm
def shfl_xor(value: Union[int, PrimExpr, tir.Call],
offset: Union[int, PrimExpr, tir.Call]):
- return tir.call_extern(value.dtype, "__shfl_xor", value, offset) \
- if check_hip_availability() \
tgt = tvm.target.Target.current(allow_none=False)
if tgt.kind.name == "rocm":
return tir.call_extern(value.dtype, "__shfl_xor", value, offset)
active = tir.call_extern("uint32", "__activemask")
return tir.call_extern(value.dtype, "__shfl_xor_sync", active, value, offset)
🤖 Prompt for AI Agents
In tilelang/language/builtin.py around line 299, the current runtime host check
and hard-coded mask for calling CUDA/HIP shuffle intrinsics should be replaced
with compile-time target dispatch: use tvm.target.Target.current() to detect if
the target is an AMD HSA/ROCM target (HIP) versus NVPTX/CUDA, and call the
appropriate intrinsic name accordingly; also replace the literal 0xffffffff with
the device intrinsic __activemask() when emitting the sync variant. Make the
same changes in shfl_down and shfl_up: use Target.current() to choose
"__shfl_xor" vs "__shfl_xor_sync" (or corresponding down/up names), and use
__activemask() as the active mask argument for the _sync calls.

The value to shuffle
"""
return tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Mirror the XOR fix for shfl_down.

Apply target-aware dispatch and active mask usage.

-    return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
+    if _is_hip_target():
+        return tir.call_extern(value.dtype, "__shfl_down", value, offset)
+    active = tir.call_extern("uint32", "__activemask")
+    return tir.call_extern(value.dtype, "__shfl_down_sync", active, value, offset)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return tir.call_extern(value.dtype, "__shfl_down", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_down_sync", 0xffffffff, value, offset)
if _is_hip_target():
return tir.call_extern(value.dtype, "__shfl_down", value, offset)
active = tir.call_extern("uint32", "__activemask")
return tir.call_extern(value.dtype, "__shfl_down_sync", active, value, offset)
🤖 Prompt for AI Agents
In tilelang/language/builtin.py around line 309, the shfl_down call must mirror
the XOR fix: make the call target-aware and supply the CUDA active mask; replace
the current single-path call with a conditional that calls "__shfl_down" (no
mask) when HIP is active and calls "__shfl_down_sync" with the active mask
(0xffffffff) as the first argument when CUDA is active, and ensure the same
XOR-based lane handling used elsewhere is preserved for shfl_down.

The value to shuffle
"""
return tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Mirror the XOR fix for shfl_up.

Apply target-aware dispatch and active mask usage.

-    return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
+    if _is_hip_target():
+        return tir.call_extern(value.dtype, "__shfl_up", value, offset)
+    active = tir.call_extern("uint32", "__activemask")
+    return tir.call_extern(value.dtype, "__shfl_up_sync", active, value, offset)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return tir.call_extern(value.dtype, "__shfl_up", value, offset) if check_hip_availability() else tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset)
if _is_hip_target():
return tir.call_extern(value.dtype, "__shfl_up", value, offset)
active = tir.call_extern("uint32", "__activemask")
return tir.call_extern(value.dtype, "__shfl_up_sync", active, value, offset)
🤖 Prompt for AI Agents
In tilelang/language/builtin.py around line 319, the shfl_up call must mirror
the XOR fix used for other shuffle ops by using target-aware dispatch and
providing the active mask for the sync variant; replace the single inline
expression with a branch that calls tir.call_extern(value.dtype, "__shfl_up",
value, offset) when check_hip_availability() is true, and
tir.call_extern(value.dtype, "__shfl_up_sync", 0xffffffff, value, offset) when
false, ensuring the call signature/order matches the other shfl_* helpers and
any XOR-based lane-index adjustment applied elsewhere is applied here too.

@LeiWang1999 LeiWang1999 merged commit cda5ea1 into tile-ai:main Sep 5, 2025
6 checks passed
RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
* [AMD] fix bugs in warp shuffle

* format

---------

Co-authored-by: tangxinsheng.txs <tangxinsheng.txs@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants