Skip to content

Conversation

@LJC00118
Copy link
Collaborator

@LJC00118 LJC00118 commented Oct 27, 2025

Summary by CodeRabbit

  • New Features
    • Added new low-level CUDA data-packing utilities to the codebase.
  • Refactor
    • Optimized internal CUDA template utilities for improved performance and code efficiency.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 27, 2025

Walkthrough

Two small packing helpers were added to the CUDA common header: make_int2 packs eight signed chars into an int2 by composing two 32-bit lanes; make_longlong4 packs eight ints into a longlong4 by composing four 64-bit lanes (via intermediate int2/make_int usage).

Changes

Cohort / File(s) Summary
CUDA utility functions
src/tl_templates/cuda/common.h
Added TL_DEVICE int2 make_int2(signed char x0, signed char x1, signed char x2, signed char x3, signed char y0, signed char y1, signed char y2, signed char y3) to pack 8 signed chars into an int2. Added TL_DEVICE longlong4 make_longlong4(int x0, int x1, int y0, int y1, int z0, int z1, int w0, int w1) to pack 8 ints into a longlong4 using intermediate make_int/int2 composition. Minor comment punctuation edit.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant make_int2 as make_int2 / make_int
    participant make_longlong4 as make_longlong4 / make_int/int2

    Note over Caller,make_int2: Packing 8 signed chars -> int2
    Caller->>make_int2: call with x0..x3, y0..y3
    make_int2-->>make_int2: compose laneA = make_int(x0,x1,x2,x3)
    make_int2-->>make_int2: compose laneB = make_int(y0,y1,y2,y3)
    make_int2-->>Caller: return int2(laneA, laneB)

    Note over Caller,make_longlong4: Packing 8 ints -> longlong4
    Caller->>make_longlong4: call with x0,x1,y0,y1,z0,z1,w0,w1
    make_longlong4-->>make_longlong4: build int2 a = make_int(x0,x1,x2?,x3?)
    make_longlong4-->>make_longlong4: build int2 b = make_int(y0,y1,...) 
    make_longlong4-->>make_longlong4: assemble longlong4 via pointer/assignment of int2 lanes
    make_longlong4-->>Caller: return longlong4
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Single header file with two small, self-contained packing helpers.
  • Review focus:
    • Correct parameter ordering and signedness for make_int2.
    • Proper construction and alignment when composing longlong4 (pointer casts/assignments).
    • Conformance with TL_DEVICE and CUDA type usage.

Poem

🐰 I nibble bytes in orderly rows,
Eight little hops where each lane goes.
Pack them tight, lane by lane,
CUDA carrots in memory, neat and plain.
Hoppity-hop — the buffers glow!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Add int2 and longlong4 pack functions" directly and accurately describes the main changes in the changeset. The two new functions make_int2 and make_longlong4 are the primary additions to the codebase, and the title explicitly names both. The title is concise (37 characters, 6 words), clear, and free of vague terminology or unnecessary noise. A teammate reviewing the PR history would immediately understand that this changeset introduces new packing functions for these specific types.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/tl_templates/cuda/common.h (1)

127-136: Consider a helper function for consistency.

The implementation uses type-punning via pointer casting (*((int2*)&result.x)), which technically violates strict aliasing rules. While this pattern works in CUDA device code, it's inconsistent with the existing make_int4 function (lines 113-125), which directly assigns to fields using a helper.

Consider introducing a make_longlong helper for consistency:

+// Pack two int values.
+TL_DEVICE long long make_longlong(int x0, int x1) {
+  long long result;
+  *((int2*)&result) = make_int2(x0, x1);
+  return result;
+}
+
 // Pack eight int values.
 TL_DEVICE longlong4 make_longlong4(int x0, int x1, int y0, int y1,
                                    int z0, int z1, int w0, int w1) {
   longlong4 result;
-  *((int2*)&result.x) = make_int2(x0, x1);
-  *((int2*)&result.y) = make_int2(y0, y1);
-  *((int2*)&result.z) = make_int2(z0, z1);
-  *((int2*)&result.w) = make_int2(w0, w1);
+  result.x = make_longlong(x0, x1);
+  result.y = make_longlong(y0, y1);
+  result.z = make_longlong(z0, z1);
+  result.w = make_longlong(w0, w1);
   return result;
 }

This approach maintains the same packing logic while matching the pattern used throughout the file.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5475f8e and 886ec09.

📒 Files selected for processing (1)
  • src/tl_templates/cuda/common.h (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Quick Lint
🔇 Additional comments (2)
src/tl_templates/cuda/common.h (2)

96-100: LGTM! Clear documentation.

The comment clarification improves readability without affecting functionality.


102-110: LGTM! Consistent pattern and good code reuse.

The implementation correctly packs eight signed chars into an int2 by reusing the existing make_int helper, maintaining consistency with the make_int4 pattern.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/tl_templates/cuda/common.h (2)

102-110: Consider renaming to avoid overload confusion with CUDA's built-in make_int2.

The implementation correctly packs 8 signed chars into an int2 by reusing make_int for each lane. However, CUDA already provides a built-in make_int2(int x, int y) function. While these are technically valid overloads (different signatures), this creates naming ambiguity:

  • In make_longlong4 (line 131-134), calls like make_int2(x0, x1) with int arguments resolve to the CUDA built-in, not this custom function.
  • Developers may be confused about which make_int2 is being invoked in different contexts.

Consider either:

  1. Renaming to make_int2_from_chars or similar to clarify intent, or
  2. Adding a comment explaining the overload and when each version is used.

127-136: Logic is correct, but consider documenting the reliance on CUDA's built-in make_int2.

The implementation correctly packs 8 ints into a longlong4:

  • The pointer cast *((int2 *)&result.x) is safe because long long (64 bits) and int2 (2×32 bits) have matching sizes.
  • Calls to make_int2(x0, x1) correctly resolve to CUDA's built-in make_int2(int, int), not the custom 8-parameter version defined above.

However, note the design inconsistency: make_int, make_int4, and the new custom make_int2 all pack signed chars, while make_longlong4 packs ints. This breaks the established pattern and may confuse maintainers who expect all make_* functions to pack the same base type.

Consider adding a comment clarifying that this function uses CUDA's built-in make_int2 for composition, or documenting why this function packs a different type than the others.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 886ec09 and 4bebd0d.

📒 Files selected for processing (1)
  • src/tl_templates/cuda/common.h (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
🔇 Additional comments (1)
src/tl_templates/cuda/common.h (1)

96-96: LGTM - Minor style consistency improvement.

The added period makes the comment consistent with other function comments in the file.

RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
* Remove an incorrect check

* add fp8 pack function

* code lint

* minor fix

* minor fix

* minor fix

* Minor fix

* Minor fix

* add pack function

* code lint

* code lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants