Skip to content

[TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. #6205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hyukn
Copy link
Collaborator

@hyukn hyukn commented Jul 20, 2025

Enable AllReduce-associated fusion patterns with fp4 and fp8 quantization in Llama3/4.

Summary by CodeRabbit

  • New Features

    • Added support for enabling or disabling fusion optimizations via environment variables.
    • Introduced quantization-aware fusion for improved performance with quantized models.
    • Enabled cross-layer fusion by linking normalization and attention modules between layers.
  • Improvements

    • Unified and streamlined logic for controlling fusion and all-reduce operations, enhancing efficiency.
    • Removed redundant normalization on final outputs for improved model performance.
    • Optimized CUDA kernel launch configuration to improve GPU execution efficiency.

Copy link

coderabbitai bot commented Jul 20, 2025

📝 Walkthrough

"""

Walkthrough

The changes refactor fusion and all-reduce logic in Llama decoder layers, introducing environment-variable-based fusion enablement, quantization-aware fusion, and cross-layer fusion via new attributes. The code unifies fusion and all-reduce control, adds quantization support, links normalization and attention modules across layers for enhanced fusion capabilities, and adds a CUDA kernel launch bounds attribute.

Changes

File(s) Change Summary
tensorrt_llm/_torch/models/modeling_llama.py Refactored Llama decoder layers to use environment-variable-controlled fusion flags; unified all-reduce logic; added quantization-aware fusion; introduced cross-layer fusion by linking normalization and attention modules; added new attributes and methods for fusion and weight loading; removed redundant normalization call; updated tensor parallel checks.
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu Added __launch_bounds__(1024) attribute to allreduce_fusion_kernel_twoshot_sync CUDA kernel for optimization.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LlamaForCausalLM
    participant DecoderLayer
    participant NextDecoderLayer

    User->>LlamaForCausalLM: load_weights(weights)
    LlamaForCausalLM->>DecoderLayer: set next_layer_layernorm, next_attn (from NextDecoderLayer)
    Note right of DecoderLayer: Enables cross-layer fusion

    User->>DecoderLayer: forward(input)
    DecoderLayer->>DecoderLayer: Check enable_fusion (from env)
    DecoderLayer->>DecoderLayer: Set fusion and all-reduce flags
    DecoderLayer->>DecoderLayer: If quantized, use quantization-aware fusion
    DecoderLayer->>DecoderLayer: Perform forward pass with fusion logic
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • nv-yilinf
    """

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d19035f and 1289a0e.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (14 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (22)
tensorrt_llm/_torch/models/modeling_llama.py (22)

2-2: LGTM: Import addition for environment variable access

The os import is appropriately added to support the new environment variable-based fusion control.


341-341: LGTM: Improved tensor parallel condition check

The change from tp_size > 1 to has_tp() is more semantically correct and consistent with the codebase patterns.


394-404: LGTM: Quantization-aware fusion operation setup

The logic correctly sets up fusion operations based on quantization mode (NVFP4, FP8, or default). The commented FP8 section suggests future enablement is planned, which is appropriate for this PR's scope.


417-420: LGTM: Unified fusion control logic

The fusion configuration now properly considers tensor parallelism, attention data parallelism, and the environment-controlled fusion flag. The logic is correctly applied to both MLP and MOE fusion scenarios.

Also applies to: 433-436


449-450: LGTM: Cross-layer fusion capability

The addition of next_layer_layernorm and next_attn attributes enables cross-layer fusion by linking layers' normalization and attention modules. This supports the enhanced fusion patterns mentioned in the PR objectives.


454-461: LGTM: Consolidated allreduce control flags

The boolean flags disable_attn_allreduce and disable_feed_forward_allreduce effectively consolidate the logic for controlling allreduce operations during attention and feed-forward phases. This simplifies the forward method implementation.


492-494: LGTM: Simplified allreduce parameter usage

The use of the consolidated disable_attn_allreduce flag cleanly controls allreduce behavior in the attention layer.


497-516: LGTM: Enhanced fusion with quantization support

The pre-fusion logic correctly handles quantization-aware fusion by:

  1. Setting appropriate scales for NVFP4/FP8 quantization
  2. Using the correct fusion operation based on quantization mode
  3. Properly unpacking fusion outputs into quantized tensor wrappers when needed

This implementation aligns with the PR's goal of enabling fusion patterns with fp4/fp8 quantization.


522-526: LGTM: Speculative execution fusion control

The logic correctly disables fusion for layers captured by speculative metadata and updates the allreduce control flag accordingly. This ensures compatibility with speculative execution scenarios.


532-534: LGTM: Consistent allreduce control in feed-forward

The feed-forward layer uses the same consolidated allreduce control pattern as the attention layer, maintaining consistency.


548-589: LGTM: Comprehensive post-fusion implementation

The post-fusion logic correctly:

  1. Determines the appropriate scale based on the next layer's quantization requirements
  2. Handles both cutlass min-latency mode and regular mode
  3. Uses the correct fusion operation based on quantization mode
  4. Properly unpacks fusion outputs for NVFP4 quantization

The implementation supports the cross-layer fusion capability introduced by the next_layer_layernorm and next_attn attributes.


607-614: LGTM: Consistent variable setup

The LlamaDecoderLayer now follows the same pattern as Llama4DecoderLayer for setting up mapping, attention DP, and quantization flags. This maintains consistency across decoder implementations.


637-641: LGTM: Consistent AllReduce and cross-layer setup

The AllReduce initialization and cross-layer fusion attributes are consistently implemented across both decoder layer types.


647-661: LGTM: Unified fusion control and operation setup

The LlamaDecoderLayer implements the same environment-controlled fusion logic and quantization-aware fusion operations as Llama4DecoderLayer, maintaining consistency.


663-668: LGTM: Consistent allreduce control flags

The consolidated allreduce control flags follow the same pattern as Llama4DecoderLayer but use the appropriate variable names (disable_mlp_allreduce vs disable_feed_forward_allreduce).


689-690: LGTM: Consistent attention allreduce control

The attention layer uses the consolidated allreduce control flag consistently with the Llama4DecoderLayer implementation.


694-715: LGTM: Pre-MLP fusion implementation

The pre-MLP fusion logic correctly mirrors the Llama4DecoderLayer implementation:

  1. Proper scale handling for quantization
  2. Correct fusion operation usage
  3. Appropriate output unpacking for NVFP4

The implementation is consistent and correct.


717-724: LGTM: Speculative execution compatibility

The speculative metadata handling correctly disables post-MLP fusion and updates the allreduce control flag. The hasattr check ensures compatibility with different spec_metadata implementations.


726-731: LGTM: MLP allreduce control

The MLP layer correctly uses the consolidated allreduce control flag.


741-762: LGTM: Post-MLP fusion implementation

The post-MLP fusion logic correctly implements cross-layer fusion:

  1. Proper scale determination for the next layer
  2. Correct fusion operation usage
  3. Appropriate output unpacking for quantization
  4. Fallback to next layer normalization when fusion isn't enabled

This completes the cross-layer fusion capability for LlamaDecoderLayer.


886-886: LGTM: Consistent tensor parallel check

The change to has_tp() maintains consistency with other parts of the codebase.


946-956: LGTM: Cross-layer fusion setup method

The new load_weights method correctly sets up the cross-layer fusion by:

  1. Linking each layer's next_layer_layernorm to the next layer's input normalization
  2. Linking each layer's next_attn to the next layer's attention module
  3. Properly handling the final layer by linking to the model's final normalization

This enables the cross-layer fusion patterns that were implemented in the decoder layers.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/thop/allreduceOp.cpp (1)

416-425: Remove dead code or clarify the forced oneshot override

The unconditional allreduce_fusion_params.use_oneshot = true; (line 425) makes the preceding branch (lines 417–423) and the TWOSHOT validation block (lines 427–432) unreachable. Please either:

  • Remove the dead code (the conditional that sets use_oneshot based on strategy/seq_len and the subsequent TWOSHOT check), if this override is permanent.
  • Or, add a clear comment explaining why oneshot is being forced, when/under what conditions it will be revisited, and disable or gate the override accordingly.

Locations to update:
• cpp/tensorrt_llm/thop/allreduceOp.cpp, around lines 417–432

Suggested diff:

-        // Determine if using oneshot or twoshot allreduce kernel
-        if (strategy == AllReduceStrategyType::MIN_LATENCY)
-        {
-            allreduce_fusion_params.use_oneshot = seq_len <= tensorrt_llm::kernels::ar_fusion::kOneShotMaxToken;
-        }
-        else
-        {
-            allreduce_fusion_params.use_oneshot = strategy == AllReduceStrategyType::ONESHOT;
-        }
-        // Force use oneshot
-        allreduce_fusion_params.use_oneshot = true;
-
-        // Check for some kernel constraints if using TWOSHOT kernel
-        if (!allreduce_fusion_params.use_oneshot)
-        {
-            TORCH_CHECK(input.size(0) >= static_cast<int64_t>(tp_size),
-                "Sequence length must be greater than or equal to TP size");
-        }
+        // Force use oneshot kernel for all fusion patterns.
+        // TODO: Remove this override or restore conditional logic after benchmarking with fp4/fp8.
+        allreduce_fusion_params.use_oneshot = true;
🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

626-629: Consider unifying fusion configuration between decoder classes

LlamaDecoderLayer uses instance attributes (self.PRE_MLP_FUSION, self.POST_MLP_FUSION) while Llama4DecoderLayer uses self.fusion_config.PRE_MLP_FUSION. Consider using a consistent approach across both classes for better maintainability.

+        self.fusion_config = EagerFusionConfig()
-        self.PRE_MLP_FUSION = self.mapping.has_tp(
+        self.fusion_config.PRE_MLP_FUSION = self.mapping.has_tp(
         ) and not self.enable_attention_dp and self.enable_fusion
-        self.POST_MLP_FUSION = self.mapping.has_tp() and self.enable_fusion
+        self.fusion_config.POST_MLP_FUSION = self.mapping.has_tp() and self.enable_fusion

Then update the usage in the forward method accordingly.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5300a99 and c10ac0b.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/thop/allreduceOp.cpp (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (15 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tensorrt_llm/thop/allreduceOp.cpp (1)

424-425: Verify performance impact of forcing oneshot kernel

Forcing the oneshot kernel for all cases might not be optimal, especially for longer sequences where the twoshot kernel could be more efficient. The PR aims to enable fusion patterns for fp4/fp8 quantization, but it's unclear if oneshot is the best choice for all scenarios.

Could you clarify:

  1. Is this change temporary for testing or permanent?
  2. Have you benchmarked the performance impact for various sequence lengths?
  3. Should this be configurable based on quantization type (fp4/fp8)?
tensorrt_llm/_torch/models/modeling_llama.py (4)

341-343: Ensure consistent AllReduce behavior with attention DP

The condition for performing AllReduce checks both enable_attention_dp and has_tp(). When attention data parallelism is enabled, the AllReduce is skipped. Please verify this is the intended behavior for the fusion patterns with fp4/fp8 quantization.


482-492: Consistent handling of quantization-aware fusion outputs

The quantization-aware fusion code properly handles NVFP4 outputs by unpacking them into Fp4QuantizedTensor objects. The implementation looks correct for both pre-fusion and post-fusion cases.

Good job on maintaining consistency across different fusion points!

Also applies to: 544-555, 673-691, 703-721


913-924: Well-designed cross-layer fusion setup

The load_weights method properly sets up cross-layer references (next_layer_layernorm and next_attn) to enable fusion across decoder layers. This is a clean approach to enable cross-layer optimizations.

The special handling for the last layer (using model.norm) is also correct.

Also applies to: 1051-1058


853-853: Mapping.has_tp() is a defined method—no change needed

The Mapping class in tensorrt_llm/mapping.py (around line 387) implements def has_tp(self):…, so invoking model_config.mapping.has_tp() is the intended API. You can safely ignore the suggestion to switch to a property-based check.

Likely an incorrect or invalid review comment.

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch 2 times, most recently from 4b557d0 to b73390b Compare July 20, 2025 17:03
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 20, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12388 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12388 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9207 completed with status: 'FAILURE'

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch 2 times, most recently from 3ca263d to 74d7950 Compare July 21, 2025 09:48
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 21, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

688-697: Similar code duplication issue as in Llama4DecoderLayer.

The fusion op assignment logic here has the same duplication pattern as in Llama4DecoderLayer. The same refactoring suggestion applies.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

699-700: Document the reason for disabling POST_MLP_FUSION.

The TODO comment indicates POST_MLP_FUSION is disabled due to accuracy drop, but lacks specificity about the issue or timeline for resolution.

Consider expanding the comment to include:

  • What kind of accuracy drop was observed
  • Under what conditions it occurs
  • Timeline or conditions for re-enabling
  • Alternative approaches being considered
-        # TODO: Disable this to avoid large accuracy drop
+        # TODO: POST_MLP_FUSION disabled due to accuracy degradation in specific workloads
+        # Re-enable after investigating root cause and implementing fixes
         self.POST_MLP_FUSION = False
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b73390b and 74d7950.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/models/modeling_llama.py (14 hunks)
🔇 Additional comments (14)
tensorrt_llm/_torch/models/modeling_llama.py (14)

2-2: LGTM: Import addition is necessary for environment variable access.

The os import is correctly added to support the new environment variable-based fusion control functionality.


341-341: LGTM: Improved abstraction for tensor parallel condition.

Using self.mapping.has_tp() instead of direct tensor parallel size comparison provides better encapsulation and consistency with the rest of the codebase.


388-390: LGTM: Environment variable control for fusion enablement.

The environment variable TRTLLM_LLAMA_EAGER_FUSION_DISABLED provides appropriate runtime control for fusion behavior. The naming is model-specific and clear in intent.


473-481: LGTM: Consolidated allreduce disable logic.

The boolean flags disable_attn_allreduce and disable_feed_forward_allreduce properly consolidate the conditions for disabling allreduce operations, making the logic clearer and more maintainable.


511-513: LGTM: Proper usage of consolidated allreduce disable flags.

The forward method correctly uses disable_attn_allreduce and disable_feed_forward_allreduce flags to control AllReduce operations, maintaining consistency with the initialization logic.

Also applies to: 557-559


516-520: LGTM: Quantization-aware scale extraction.

The scale extraction logic properly handles both NVFP4 and FP8 quantization modes, providing the necessary scale information for fusion operations.


547-552: LGTM: Proper handling of NVFP4 quantized outputs.

The code correctly unpacks NVFP4 quantized outputs into Fp4QuantizedTensor objects, maintaining the quantization state through the computation pipeline.


570-628: LGTM: Cross-layer fusion implementation.

The cross-layer fusion logic using next_layer_layernorm and next_attn attributes enables advanced fusion patterns. The implementation handles both normal and cutlass min-latency modes appropriately, with proper quantization support.


642-650: LGTM: Proper initialization of new member variables.

The addition of mapping, quantization flags, and next layer references follows the same pattern as Llama4DecoderLayer and provides necessary context for fusion operations.

Also applies to: 672-676


682-687: LGTM: Consistent fusion control implementation.

The environment variable-based fusion control and flag initialization follows the same pattern as Llama4DecoderLayer, maintaining consistency across model variants.


728-730: LGTM: Consistent allreduce disable flag usage.

The forward method correctly uses the consolidated disable flags for controlling AllReduce operations in both attention and MLP sections.

Also applies to: 758-760


733-755: LGTM: Quantization-aware PRE_MLP_FUSION implementation.

The PRE_MLP_FUSION logic properly handles scale extraction for NVFP4 quantization and correctly unpacks fusion outputs into appropriate tensor formats.


913-913: LGTM: Consistent tensor parallel condition abstraction.

Same improvement as seen elsewhere - using has_tp() instead of direct size comparison provides better encapsulation.


973-984: LGTM: Essential cross-layer fusion setup.

The load_weights method correctly establishes the cross-layer references needed for fusion patterns. The logic properly handles both intermediate layers (linking to next layer's components) and the final layer (linking to the model's norm).

This enables the cross-layer fusion capabilities referenced in the forward methods of the decoder layers.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12439 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12439 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9251 completed with status: 'FAILURE'

@hyukn hyukn marked this pull request as ready for review July 22, 2025 06:10
@hyukn hyukn requested a review from a team as a code owner July 22, 2025 06:10
@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from 74d7950 to a4d3136 Compare July 22, 2025 06:12
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 22, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tensorrt_llm/_torch/models/modeling_llama.py (3)

391-400: Note: Fusion op assignment logic is still duplicated.

This code segment still has the same duplication issue mentioned in the past review comment. The fusion op assignment logic is repeated between this section and the LlamaDecoderLayer class (lines 649-658). The previous suggestion to consolidate this into shared variables still applies.


643-644: Consistent environment variable naming needed.

Same issue as in Llama4DecoderLayer - the environment variable name should be consistent across both classes.


649-658: Code duplication: Fusion op assignment logic repeated.

This fusion op assignment logic is duplicated from the Llama4DecoderLayer class (lines 391-400). The previous review suggestion to consolidate this into shared variables still applies to reduce code duplication.

🧹 Nitpick comments (3)
tensorrt_llm/_torch/models/modeling_llama.py (3)

388-390: Consider renaming the environment variable for clarity.

The environment variable TRTLLM_LLAMA_EAGER_FUSION_DISABLED is used for both Llama and Llama4 models, which might be confusing. Consider using a more generic name like TRTLLM_EAGER_FUSION_DISABLED or model-specific names.

-        self.enable_fusion = os.environ.get(
-            "TRTLLM_LLAMA_EAGER_FUSION_DISABLED", "0") == "0"
+        self.enable_fusion = os.environ.get(
+            "TRTLLM_EAGER_FUSION_DISABLED", "0") == "0"

461-461: Remove debug print statement.

The debug print statement should be removed before merging to production.

-        print(f"init Llama4DecoderLayer")

579-579: Remove debug print statements.

Debug print statements should be removed before production deployment.

-            print(f"{self.layer_idx}, {self.next_layer_layernorm}")
-        print(f"in forward")

Also applies to: 583-583

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74d7950 and a4d3136.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/models/modeling_llama.py (14 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tensorrt_llm/_torch/models/modeling_llama.py (9)

2-2: LGTM - Import addition for environment variable support.

The os import is correctly added to support the environment variable-based fusion control introduced later in the code.


341-341: LGTM - Consistent use of mapping API.

The change from tp_size > 1 to has_tp() improves consistency with the mapping API usage pattern throughout the codebase.


452-459: LGTM - Well-structured all-reduce disable flags.

The consolidation of all-reduce disable logic into boolean flags (disable_attn_allreduce and disable_feed_forward_allreduce) improves code clarity and maintainability by centralizing the conditions.


517-521: LGTM - Proper quantization handling in fusion.

The quantization-aware fusion logic correctly unpacks the fusion output into separate components (fp4 tensor, scale factor, and residual) when NVFP4 quantization is enabled, and wraps them appropriately.

Also applies to: 584-588


660-661: Clarify the accuracy drop issue.

The TODO comment mentions disabling fusion "to avoid large accuracy drop" but doesn't provide details about the cause or planned resolution. This could impact performance benefits.

Can you provide more context about this accuracy drop issue? Is this a temporary workaround, and what's the timeline for fixing it?


663-668: LGTM - Consistent disable flags pattern.

The all-reduce disable flags follow the same well-structured pattern as the Llama4DecoderLayer, improving code consistency and maintainability.


694-722: LGTM - Comprehensive fusion logic with quantization support.

The pre-MLP and post-MLP fusion logic properly handles both regular and quantized (NVFP4) cases, correctly unpacking fusion outputs and creating appropriate tensor wrappers. The cross-layer fusion setup is also well-implemented.

Also applies to: 731-752


874-874: LGTM - Consistent mapping API usage.

The change to use has_tp() instead of tp_size > 1 maintains consistency with the mapping API pattern used throughout the codebase.


934-944: LGTM - Well-implemented cross-layer fusion setup.

The load_weights method correctly establishes cross-layer references needed for AllReduce fusion patterns. The logic properly handles the last layer (linking to final norm) and intermediate layers (linking to next layer's input normalization and attention modules).

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12528 [ run ] triggered by Bot

@hyukn hyukn requested a review from yilin-void July 22, 2025 07:12
@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from a4d3136 to d84c4ce Compare July 22, 2025 09:20
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12528 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9313 completed with status: 'FAILURE'

@hyukn
Copy link
Collaborator Author

hyukn commented Jul 22, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12579 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12579 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9357 completed with status: 'FAILURE'

@hyukn
Copy link
Collaborator Author

hyukn commented Jul 23, 2025

/bot run --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12630 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12630 [ run ] completed with state FAILURE

@hyukn
Copy link
Collaborator Author

hyukn commented Jul 23, 2025

/bot run --only-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12635 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12635 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9391 (Partly Tested) completed with status: 'FAILURE'

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from d84c4ce to 36c9c91 Compare July 24, 2025 06:48
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

388-389: Environment variable name inconsistency.

The environment variable name TRTLLM_LLAMA_EAGER_FUSION_DISABLED differs from the past review comments which referenced TRTLLM_DEEPSEEK_EAGER_FUSION_DISABLED. Ensure consistent naming across the codebase.

🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_llama.py (2)

461-461: Remove debug print statement.

This debug print should be removed before merging to production.

-        print(f"init Llama4DecoderLayer")

586-586: Remove debug print statements.

These debug prints should be removed before merging to production.

-            print(f"{self.layer_idx}, {self.next_layer_layernorm}")
-        print(f"in forward")

Also applies to: 590-590

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d84c4ce and 36c9c91.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (14 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/models/modeling_llama.py (1)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (16)
tensorrt_llm/_torch/models/modeling_llama.py (16)

2-2: LGTM: Import addition for environment variable support.

The os import is correctly added to support the environment variable-based fusion control introduced in the decoder layers.


341-341: LGTM: Consistent tensor parallel check.

Using has_tp() instead of tp_size > 1 maintains consistency with other tensor parallel checks throughout the codebase.


391-400: LGTM: Proper quantization-aware fusion operation setup.

The fusion operations are correctly configured based on quantization mode, with NVFP4 support implemented and FP8 support prepared for future enablement.


414-417: LGTM: Correct fusion flag configuration.

The fusion flags properly incorporate tensor parallelism presence, attention data parallelism state, and fusion enablement, ensuring fusion is only enabled when appropriate.

Also applies to: 431-434


452-459: LGTM: Well-designed consolidated all-reduce control.

The disable flags correctly consolidate fusion configuration, tensor parallel size, and attention DP state into clear boolean flags that control all-reduce behavior throughout the forward pass.


492-494: LGTM: Proper all-reduce control in attention.

The all-reduce parameters correctly use the consolidated disable flag to control all-reduce behavior during the attention phase.


497-510: LGTM: Correct quantization-aware fusion implementation.

The scale extraction for quantized modes and NVFP4 tensor unpacking are implemented correctly, properly handling the quantized tensor format returned by fusion operations.

Also applies to: 517-521


524-528: LGTM: Proper speculative metadata handling.

The fusion disabling logic for layers captured by speculative metadata is correctly implemented to avoid interference with speculative execution.


547-582: LGTM: Well-implemented cross-layer fusion.

The cross-layer fusion logic correctly references the next layer's normalization and attention modules, with proper quantization-aware scale handling for both normal and min-latency modes.


610-617: LGTM: Necessary attributes for fusion logic.

The additional attributes for mapping, attention DP, and quantization flags are correctly added to support the fusion implementation.


640-643: LGTM: Proper initialization of cross-layer fusion attributes.

The AllReduce initialization and cross-layer reference attributes are correctly set up to enable fusion between adjacent layers.


650-676: LGTM: Consistent fusion control implementation.

The environment variable-based fusion control and consolidated disable flags follow the same correct pattern as Llama4DecoderLayer, ensuring consistent behavior across model variants.


697-698: LGTM: Consistent fusion implementation for LlamaDecoderLayer.

The fusion logic correctly mirrors the Llama4DecoderLayer implementation, with proper quantization handling, speculative metadata support, and cross-layer fusion capabilities.

Also applies to: 702-723, 731-736, 745-766


888-888: LGTM: Consistent tensor parallel check.

Using has_tp() instead of tp_size > 1 maintains consistency with the tensor parallel checks used throughout the fusion logic.


948-958: LGTM: Proper cross-layer fusion setup.

The load_weights method correctly establishes cross-layer references by linking each decoder layer to the next layer's normalization and attention modules, with proper handling of the final layer boundary condition.


1086-1093: LGTM: Consistent cross-layer fusion setup for Llama4.

The cross-layer fusion setup in the Llama4 conditional generation model correctly mirrors the implementation in LlamaForCausalLM, ensuring consistent fusion behavior across model variants.

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from 36c9c91 to 3c9e0ed Compare July 24, 2025 06:51
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 24, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from 3c9e0ed to d19035f Compare July 24, 2025 06:55
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12818 [ run ] triggered by Bot

@hyukn
Copy link
Collaborator Author

hyukn commented Jul 24, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12822 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12818 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12822 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9556 completed with status: 'FAILURE'

…ama3/4.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
@hyukn hyukn force-pushed the feat/llama_ar_fusion branch from d19035f to 1289a0e Compare July 24, 2025 13:49
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 24, 2025

/bot run --add-multi-gpu-test --disable-fail-fast

@coderabbitai coderabbitai bot requested a review from nv-yilinf July 24, 2025 13:49
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12863 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12863 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9587 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants