-
Notifications
You must be signed in to change notification settings - Fork 333
[Compatibility] Support CUDA 11.3 #1290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request restricts CUDA atomic operations and FP8-specific debug functionality to supported CUDA and GPU architecture versions via preprocessor guards. It introduces a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Possibly related issues
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/tl_templates/cuda/atomic.h(8 hunks)src/tl_templates/cuda/debug.h(5 hunks)src/tl_templates/cuda/gemm_mma.h(0 hunks)
💤 Files with no reviewable changes (1)
- src/tl_templates/cuda/gemm_mma.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
Applied to files:
src/tl_templates/cuda/atomic.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Quick Lint
🔇 Additional comments (2)
src/tl_templates/cuda/debug.h (1)
3-5: LGTM! FP8 architecture guards properly implemented.The conditional compilation guards correctly restrict FP8 type support to architectures >= 890 (SM 8.9+), preventing compilation errors on older GPUs that lack FP8 hardware support. The pattern is consistently applied to the header include and all FP8 specializations.
Also applies to: 123-144, 251-273
src/tl_templates/cuda/atomic.h (1)
15-19: LGTM! Clear runtime diagnostic for unsupported paths.The
TL_NOT_IMPLEMENTED()macro provides clear diagnostics when unsupported atomic operations are invoked on older CUDA versions, using a breakpoint to halt execution for debugging.
| #if CUDART_VERSION >= 11080 | ||
| cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address); | ||
| aref.fetch_max(cuda_cast<NT1>(val), cuda::memory_order(memory_order)); | ||
| #else | ||
| TL_NOT_IMPLEMENTED(); | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify CUDA version requirement matches PR objectives.
The code gates atomic operations behind CUDART_VERSION >= 11080 (CUDA 11.8), but the PR title indicates support for CUDA 11.3 (version 11030). Please confirm whether CUDA 11.3 lacks cuda::atomic_ref support, or if the version check should be lowered to 11030.
Also applies to: 100-106, 163-169, 240-245, 318-324, 650-655, 661-666
🌐 Web query:
What CUDA version introduced cuda::atomic_ref support?
💡 Result:
cuda::atomic_ref was introduced in CUDA Toolkit 11.6. [1][2]
Sources:
[1] CUDA Core Compute Libraries — Synchronization Primitives (cuda::atomic_ref listed as CUDA 11.6).
[2] libcu++ 1.7.0 release notes (libcu++ 1.7.0 / CUDA Toolkit 11.6 adds atomic_ref).
Confirm PR target version—CUDA 11.3 cannot support cuda::atomic_ref operations.
The cuda::atomic_ref API was introduced in CUDA Toolkit 11.6, not 11.3. The version gates at CUDART_VERSION >= 11080 (CUDA 11.8) are unnecessarily restrictive, but lowering them to 11030 (CUDA 11.3) will not work—the API didn't exist until 11.6.
Either the PR target version should be CUDA 11.6+, or alternative implementations must be provided for CUDA 11.3-11.5. The current TL_NOT_IMPLEMENTED() fallback would fail at runtime for CUDA 11.3 users.
Applies to: lines 70-75, 100-106, 163-169, 240-245, 318-324, 650-655, 661-666
🤖 Prompt for AI Agents
In src/tl_templates/cuda/atomic.h around lines 70-75, 100-106, 163-169, 240-245,
318-324, 650-655 and 661-666 the code gates use cuda::atomic_ref which was
introduced in CUDA 11.6; update the CUDART_VERSION checks to >= 11060 if we
intend to require CUDA 11.6+ and document that PR target or CI must use 11.6+,
otherwise replace the TL_NOT_IMPLEMENTED() fallbacks with a portable atomic
implementation (e.g. an atomicCAS loop or intrinsics) for CUDA 11.3–11.5 to
perform the same operations, and ensure all version guards and documentation
reflect the chosen approach.
| #if CUDART_VERSION >= 11080 | ||
| cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address); | ||
| aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order)); | ||
| return static_cast<T1>( | ||
| aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order))); | ||
| #else | ||
| TL_NOT_IMPLEMENTED(); | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Return statement in void function.
AtomicMin is declared as returning void (line 111), but lines 134-135 contain a return statement with a value. This will cause a compilation error.
Apply this diff to remove the erroneous return statement:
#if CUDART_VERSION >= 11080
cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address);
- return static_cast<T1>(
- aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order)));
+ aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order));
#else
TL_NOT_IMPLEMENTED();
#endif📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #if CUDART_VERSION >= 11080 | |
| cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address); | |
| aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order)); | |
| return static_cast<T1>( | |
| aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order))); | |
| #else | |
| TL_NOT_IMPLEMENTED(); | |
| #endif | |
| #if CUDART_VERSION >= 11080 | |
| cuda::atomic_ref<NT1, cuda::thread_scope_device> aref(*address); | |
| aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order)); | |
| #else | |
| TL_NOT_IMPLEMENTED(); | |
| #endif |
🤖 Prompt for AI Agents
In src/tl_templates/cuda/atomic.h around lines 132 to 138, the AtomicMin
implementation is in a function declared void but contains a returned value;
remove the erroneous return so the function does not return a value. Replace the
line returning static_cast<T1>(aref.fetch_min(...)) with a call to
aref.fetch_min(cuda_cast<NT1>(val), cuda::memory_order(memory_order)) and
discard the result (or cast val beforehand) so the operation executes without
returning anything; keep the TL_NOT_IMPLEMENTED() fallback as-is.
This pull request introduces conditional compilation and improved error handling for CUDA atomic operations and debug utilities, ensuring compatibility with different CUDA versions and architectures. The main changes add checks for CUDA version and architecture before using certain features, and provide clear error messages when unsupported features are accessed.
CUDA atomic operations compatibility and error handling:
TL_NOT_IMPLEMENTED()macro inatomic.hto print a message and trigger a breakpoint when an unsupported atomic operation is called, improving debuggability.cuda::atomic_ref-based atomic operations (AtomicMax,AtomicMaxRet,AtomicMin,AtomicMinRet,AtomicAdd,AtomicAddRet,AtomicLoad,AtomicStore) in#if CUDART_VERSION >= 11080checks, callingTL_NOT_IMPLEMENTED()if the CUDA version is too old. This prevents compilation or runtime errors on unsupported CUDA versions. [1] [2] [3] [4] [5] [6] [7]Debug utilities and architecture checks:
fp8_e4_tandfp8_e5_tdebug printing and buffer value specializations in#if __CUDA_ARCH_LIST__ >= 890checks, and includedcuda_fp8.honly when this architecture is present. This ensures these features are only available on supported architectures. [1] [2] [3] [4] [5]cuda_fp8.hfromgemm_mma.h, which is now only included when needed by architecture checks elsewhere.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.