Add int2 and longlong4 pack functions #1129

LJC00118 · 2025-10-27T07:52:07Z

Summary by CodeRabbit

New Features
- Added new low-level CUDA data-packing utilities to the codebase.
Refactor
- Optimized internal CUDA template utilities for improved performance and code efficiency.

coderabbitai · 2025-10-27T07:52:20Z

Walkthrough

Two small packing helpers were added to the CUDA common header: make_int2 packs eight signed chars into an int2 by composing two 32-bit lanes; make_longlong4 packs eight ints into a longlong4 by composing four 64-bit lanes (via intermediate int2/make_int usage).

Changes

Cohort / File(s)	Summary
CUDA utility functions `src/tl_templates/cuda/common.h`	Added `TL_DEVICE int2 make_int2(signed char x0, signed char x1, signed char x2, signed char x3, signed char y0, signed char y1, signed char y2, signed char y3)` to pack 8 signed chars into an `int2`. Added `TL_DEVICE longlong4 make_longlong4(int x0, int x1, int y0, int y1, int z0, int z1, int w0, int w1)` to pack 8 ints into a `longlong4` using intermediate `make_int`/`int2` composition. Minor comment punctuation edit.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant make_int2 as make_int2 / make_int
    participant make_longlong4 as make_longlong4 / make_int/int2

    Note over Caller,make_int2: Packing 8 signed chars -> int2
    Caller->>make_int2: call with x0..x3, y0..y3
    make_int2-->>make_int2: compose laneA = make_int(x0,x1,x2,x3)
    make_int2-->>make_int2: compose laneB = make_int(y0,y1,y2,y3)
    make_int2-->>Caller: return int2(laneA, laneB)

    Note over Caller,make_longlong4: Packing 8 ints -> longlong4
    Caller->>make_longlong4: call with x0,x1,y0,y1,z0,z1,w0,w1
    make_longlong4-->>make_longlong4: build int2 a = make_int(x0,x1,x2?,x3?)
    make_longlong4-->>make_longlong4: build int2 b = make_int(y0,y1,...) 
    make_longlong4-->>make_longlong4: assemble longlong4 via pointer/assignment of int2 lanes
    make_longlong4-->>Caller: return longlong4

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Single header file with two small, self-contained packing helpers.
Review focus:
- Correct parameter ordering and signedness for make_int2.
- Proper construction and alignment when composing longlong4 (pointer casts/assignments).
- Conformance with TL_DEVICE and CUDA type usage.

Poem

🐰 I nibble bytes in orderly rows,
Eight little hops where each lane goes.
Pack them tight, lane by lane,
CUDA carrots in memory, neat and plain.
Hoppity-hop — the buffers glow!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "Add int2 and longlong4 pack functions" directly and accurately describes the main changes in the changeset. The two new functions `make_int2` and `make_longlong4` are the primary additions to the codebase, and the title explicitly names both. The title is concise (37 characters, 6 words), clear, and free of vague terminology or unnecessary noise. A teammate reviewing the PR history would immediately understand that this changeset introduces new packing functions for these specific types.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-10-27T07:52:22Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/tl_templates/cuda/common.h (1)
127-136: Consider a helper function for consistency.

The implementation uses type-punning via pointer casting (*((int2*)&result.x)), which technically violates strict aliasing rules. While this pattern works in CUDA device code, it's inconsistent with the existing make_int4 function (lines 113-125), which directly assigns to fields using a helper.

Consider introducing a make_longlong helper for consistency:
+// Pack two int values.
+TL_DEVICE long long make_longlong(int x0, int x1) {
+  long long result;
+  *((int2*)&result) = make_int2(x0, x1);
+  return result;
+}
+
 // Pack eight int values.
 TL_DEVICE longlong4 make_longlong4(int x0, int x1, int y0, int y1,
                                    int z0, int z1, int w0, int w1) {
   longlong4 result;
-  *((int2*)&result.x) = make_int2(x0, x1);
-  *((int2*)&result.y) = make_int2(y0, y1);
-  *((int2*)&result.z) = make_int2(z0, z1);
-  *((int2*)&result.w) = make_int2(w0, w1);
+  result.x = make_longlong(x0, x1);
+  result.y = make_longlong(y0, y1);
+  result.z = make_longlong(z0, z1);
+  result.w = make_longlong(w0, w1);
   return result;
 }
This approach maintains the same packing logic while matching the pattern used throughout the file.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5475f8e and 886ec09.

📒 Files selected for processing (1)

src/tl_templates/cuda/common.h (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

🔇 Additional comments (2)

src/tl_templates/cuda/common.h (2)

96-100: LGTM! Clear documentation.

The comment clarification improves readability without affecting functionality.

102-110: LGTM! Consistent pattern and good code reuse.

The implementation correctly packs eight signed chars into an int2 by reusing the existing make_int helper, maintaining consistency with the make_int4 pattern.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/tl_templates/cuda/common.h (2)

102-110: Consider renaming to avoid overload confusion with CUDA's built-in make_int2.

The implementation correctly packs 8 signed chars into an int2 by reusing make_int for each lane. However, CUDA already provides a built-in make_int2(int x, int y) function. While these are technically valid overloads (different signatures), this creates naming ambiguity:

In make_longlong4 (line 131-134), calls like make_int2(x0, x1) with int arguments resolve to the CUDA built-in, not this custom function.

Developers may be confused about which make_int2 is being invoked in different contexts.

Consider either:

Renaming to make_int2_from_chars or similar to clarify intent, or

Adding a comment explaining the overload and when each version is used.

127-136: Logic is correct, but consider documenting the reliance on CUDA's built-in make_int2.

The implementation correctly packs 8 ints into a longlong4:

The pointer cast *((int2 *)&result.x) is safe because long long (64 bits) and int2 (2×32 bits) have matching sizes.

Calls to make_int2(x0, x1) correctly resolve to CUDA's built-in make_int2(int, int), not the custom 8-parameter version defined above.

However, note the design inconsistency: make_int, make_int4, and the new custom make_int2 all pack signed chars, while make_longlong4 packs ints. This breaks the established pattern and may confuse maintainers who expect all make_* functions to pack the same base type.

Consider adding a comment clarifying that this function uses CUDA's built-in make_int2 for composition, or documenting why this function packs a different type than the others.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 886ec09 and 4bebd0d.

📒 Files selected for processing (1)

src/tl_templates/cuda/common.h (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)

🔇 Additional comments (1)

src/tl_templates/cuda/common.h (1)

96-96: LGTM - Minor style consistency improvement.

The added period makes the comment consistent with other function comments in the file.

* Remove an incorrect check * add fp8 pack function * code lint * minor fix * minor fix * minor fix * Minor fix * Minor fix * add pack function * code lint * code lint

LJC00118 added 14 commits September 26, 2025 17:23

Remove an incorrect check

1d2ff3a

Merge branch 'main' into myfork

c6b34ef

add fp8 pack function

fe15b64

code lint

6a47b0e

Merge remote-tracking branch 'origin/main' into myfork

6449423

minor fix

e2a1fee

minor fix

2be0c5e

minor fix

22f1b06

Minor fix

c1fa7cf

Merge remote-tracking branch 'origin/main' into myfork

f0e4782

Minor fix

d3d643a

Merge remote-tracking branch 'origin/main' into myfork

7a6f689

add pack function

84a8898

code lint

886ec09

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

code lint

4bebd0d

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

LeiWang1999 approved these changes Oct 27, 2025

View reviewed changes

LeiWang1999 merged commit 4c9da81 into tile-ai:main Oct 27, 2025
6 checks passed

This was referenced Nov 5, 2025

[Feat] Add A Pass to Handle Negative Index #1192

Merged

[Fix] Fix buffer re-import typo in tilelang.languge #1214

Merged

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

kurisu6912 mentioned this pull request Nov 12, 2025

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

coderabbitai bot mentioned this pull request Nov 24, 2025

[Language] support T.gemm_sp_v2 on sm80 and sm89 #1056

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add int2 and longlong4 pack functions #1129

Add int2 and longlong4 pack functions #1129

Uh oh!

LJC00118 commented Oct 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add int2 and longlong4 pack functions #1129

Add int2 and longlong4 pack functions #1129

Uh oh!

Conversation

LJC00118 commented Oct 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LJC00118 commented Oct 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 27, 2025 •

edited

Loading