-
Notifications
You must be signed in to change notification settings - Fork 332
[AMD] enable amd ci test & fix bug & fix dockerfile #1244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
WalkthroughThis PR consolidates Docker installation methods across multiple CUDA versions and ROCm by replacing shell script invocations with Python package installs using environment variables, broadens ROCm test discovery scope, adds float8_e4m3fnuz dtype support to AMD gemm tests, and refactors MFMA intrinsic emitter constructors to centralize base class initialization. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.github/workflows/ci.yml(1 hunks)docker/Dockerfile.cu118(1 hunks)docker/Dockerfile.cu120(1 hunks)docker/Dockerfile.cu121(1 hunks)docker/Dockerfile.cu123(1 hunks)docker/Dockerfile.cu124(1 hunks)docker/Dockerfile.cu125(1 hunks)docker/Dockerfile.cu126(1 hunks)docker/Dockerfile.cu128(1 hunks)docker/Dockerfile.rocm(1 hunks)testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py(1 hunks)testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py(3 hunks)tilelang/intrinsics/mfma_macro_generator.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py (1)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)
assert_tl_matmul_correctness(163-220)
tilelang/intrinsics/mfma_macro_generator.py (3)
tilelang/tileop/gemm/gemm_base.py (3)
accum_dtype(59-60)chunk(63-64)k_pack(111-112)tilelang/primitives/gemm/gemm_mma.py (1)
accum_dtype(252-259)tilelang/tileop/gemm/__init__.py (1)
k_pack(129-130)
testing/python/amd/test_tilelang_gemm_mfma_intrinsic.py (1)
tilelang/primitives/gemm/gemm_mma.py (1)
in_dtype(235-249)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Quick Lint
🔇 Additional comments (2)
.github/workflows/ci.yml (1)
372-382: <!-- [scratchpad_start]
[task final_analysis]
Understand requires_rocm from TVM and assess risk.[observations]
- requires_rocm is imported from tvm.testing.utils, not custom implementation
- TVM's requires_rocm is a standard pytest skip decorator
- It skips tests if ROCm support is NOT available
- All three test files decorated with @tilelang.testing.requires_rocm
- Two test files have hardcoded device="cuda"
[analysis]
RESOLVED: The hardcoded device="cuda" is actually CORRECT for ROCm CI:
- PyTorch compiled with ROCm support (which ROCm CI runners have) maps "cuda" device to HIP backend
- requires_rocm decorator ensures tests only run when ROCm is available
- If ROCm unavailable → tests skipped (safe)
- If ROCm available → device="cuda" routes to HIP (works correctly)
[conclusion]
Original concern about device="cuda" incompatibility is INVALID.
The tests are properly gated and device handling is correct for ROCm environment.
No additional issues found in test files.[status COMPLETE]
Review comment is INCORRECT - it assumes a problem that doesn't exist.
Perfect! I now have all the information needed. The
requires_rocmdecorator is from TVM's testing utilities and properly gates test execution. Let me verify there are no other issues by checking the complete test scope:Based on my thorough investigation, I can now provide the definitive rewritten review comment. The evidence shows that:
- The
@tvm.testing.requires_rocmdecorator skips tests if ROCm is not available or does not have appropriate hardware- All three AMD test files are properly decorated with
@tilelang.testing.requires_rocm- The hardcoded
device="cuda"is correct because PyTorch compiled with ROCm support automatically routes CUDA device allocation to the HIP backend- The broader test scope from a single file to the entire
./python/amddirectory is intentional and all tests within it are properly gatedNo issues found. Tests are properly ROCm-compatible and gated.
All test functions in
./testing/python/amd/have the@tilelang.testing.requires_rocmdecorator, which skips tests if ROCm is unavailable. The hardcodeddevice="cuda"calls are correct—PyTorch built with ROCm support automatically routes CUDA device operations to the HIP backend. The change from a single test file to the entire./python/amddirectory is intentional and safe since all tests within it are properly decorated.docker/Dockerfile.cu126 (1)
25-26: No changes exist for docker/Dockerfile.cu126 in this PR.The git diff shows that
docker/Dockerfile.cu126is unchanged between HEAD and the current branch. The file already containsUSE_CUDA=1 pip install -e . -vin both versions. The review comment references a change from./install_cuda.sh, but this file does not appear to have been modified in this PR, nor wasinstall_cuda.shthe previous state of this Dockerfile (the file was already usingUSE_CUDA=1 pip install -e . -v).While the underlying concern about environment variable support is technically valid—TileLang's CMakeLists.txt does properly read the
$ENV{USE_CUDA}flag at lines 167–174, and scikit-build-core correctly passes environment variables to CMake—there is no actual code change in this file to verify.Likely an incorrect or invalid review comment.
| super().__init__( | ||
| a_dtype=a_dtype, | ||
| b_dtype=b_dtype, | ||
| accum_dtype=accum_dtype, | ||
| a_transposed=a_transposed, | ||
| b_transposed=b_transposed, | ||
| block_row_warps=block_row_warps, | ||
| block_col_warps=block_col_warps, | ||
| warp_row_tiles=warp_row_tiles, | ||
| warp_col_tiles=warp_col_tiles, | ||
| chunk=chunk, | ||
| reduce_k=reduce_k, | ||
| num_elems_per_byte=num_elems_per_byte, | ||
| k_pack=k_pack, | ||
| is_m_first=is_m_first, | ||
| thread_var=thread_var, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure preshuffle emitter honors custom thread binding
The new thread_var argument is exposed here, but ldmatrix_a/ldmatrix_b still fetch the binding from T.KernelLaunchFrame.Current(). If a caller now supplies thread_var, those methods ignore it and still assert on an active kernel frame, so the preshuffle path breaks for the newly supported use case. Please route both loaders through self.get_thread_binding() like the base emitter.
Apply this diff:
@@ def ldmatrix_a(...):
- current_frame = T.KernelLaunchFrame.Current()
- thread_binding = current_frame.get_thread_binding()
+ thread_binding = self.get_thread_binding()
@@ def ldmatrix_b(...):
- current_frame = T.KernelLaunchFrame.Current()
- thread_binding = current_frame.get_thread_binding()
+ thread_binding = self.get_thread_binding()Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tilelang/intrinsics/mfma_macro_generator.py around lines 681 to 697, the
constructor now accepts thread_var but the preshuffle loaders
ldmatrix_a/ldmatrix_b still read thread binding from
T.KernelLaunchFrame.Current(); update those loader calls so they use
self.get_thread_binding() (which respects the supplied thread_var) instead of
directly querying KernelLaunchFrame.Current(), and remove or replace the
assertion that an active kernel frame is required so the preshuffle path works
with an externally provided thread_var.
Summary by CodeRabbit
New Features
Tests
Chores