Skip to content

Conversation

@juju812
Copy link
Contributor

@juju812 juju812 commented Nov 24, 2025

📌 Description

This PR is to fix a memory leak bug caused by AutoTuner LRU cache and dynamic lambda TuningConfig

🔍 Related Issues

#2139

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Performance
    • Reduced autotuner overhead by caching runner parameter names to avoid repeated signature inspection during profiling, speeding up tuning runs.
  • New Features
    • Centralized reusable tuning presets for mixed-precision GEMM (FP8/FP4) with additional tuning presets to improve autotuning and execution efficiency.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings November 24, 2025 10:04
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 24, 2025

Warning

Rate limit exceeded

@yzh119 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 23 minutes and 16 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 79a7ae1 and 79a3721.

📒 Files selected for processing (1)
  • flashinfer/gemm/gemm_base.py (4 hunks)

Walkthrough

Precompute a mapping from each runner to its forward parameter-name set for reuse during profiling; consolidate GEMM tuning by adding module-level tuning configs and a small helper, and refactor fp8_gemm_sm100 and mm_fp4 to use those shared configs.

Changes

Cohort / File(s) Change Summary
Profiling Loop Optimization
flashinfer/autotuner.py
Add runner_arg_names_map that maps each runner to the set of parameter names from r.forward; reuse this map inside the profiling loop instead of repeatedly calling inspect.signature(...).parameters. Invocation now uses the cached set to check for do_preparation and to build **kwargs.
GEMM Tuning Configuration Consolidation
flashinfer/gemm/gemm_base.py
Introduce module-level tuning configs: _FP8_GEMM_SM100_TUNING_CONFIG, _MM_FP4_TUNING_CONFIG_8x4, _MM_FP4_TUNING_CONFIG_128x4; add helper _pad_up(x, y); refactor fp8_gemm_sm100() and mm_fp4() to select and use these shared configs instead of constructing tuning configs inline.

Sequence Diagram(s)

sequenceDiagram
  participant Autotuner
  participant Profiler
  participant Runner as Runner(s)

  rect rgb(221,235,247)
    note right of Autotuner: setup
    Autotuner->>Autotuner: build runner_arg_names_map (runner -> param-name set)
  end

  rect rgb(247,246,221)
    note right of Profiler: profiling loop
    loop each profiling iteration
      Profiler->>Autotuner: request param-name set for runner
      Autotuner-->>Profiler: return cached param-name set
      Profiler->>Runner: invoke Runner.forward(**kwargs) using cached set (check do_preparation)
      Runner-->>Profiler: return result
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Files/areas to review closely:
    • correctness and constraints in the new tuning config constants in gemm_base.py
    • that mm_fp4 maintains prior layout/selection semantics when switching to the module-level configs
    • edge cases and input validation for _pad_up(x, y)
    • ensuring autotuner.py cached map covers all runner variants and preserves prior invocation behavior

Poem

🐇 I hopped through code at dawn’s first light,
I cached the names to speed the flight.
Tuning gathered in a single den,
Neat configs now ready — time to zen.
A tiny hop, a faster run — hooray!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly addresses the main purpose of the PR: fixing a memory leak via global TuningConfig instead of dynamic lambdas in AutoTuner.
Description check ✅ Passed The description includes all required sections: a brief explanation of the memory leak fix, related issue link, and completed pre-commit and test checklists.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @juju812, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a memory leak bug within the AutoTuner's LRU cache. The leak was primarily caused by the continuous dynamic creation of TuningConfig objects and the repeated use of introspection for runner argument names. The solution involves promoting these TuningConfig objects to global constants and pre-computing argument names, thereby reducing object churn and improving overall memory management and performance.

Highlights

  • Memory Leak Fix: Addresses a memory leak caused by the AutoTuner's LRU cache and the dynamic creation of TuningConfig objects.
  • Global TuningConfig: Refactors TuningConfig instances for fp8_gemm_sm100 and mm_fp4 operations into global constants, preventing their repeated dynamic instantiation.
  • Performance Optimization: Optimizes the AutoTuner.choose_one method by pre-computing runner argument names, reducing the overhead of repeated inspect.signature calls within a loop.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copilot finished reviewing on behalf of juju812 November 24, 2025 10:06
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a memory leak associated with the AutoTuner's LRU cache. The root cause, which was the dynamic creation of TuningConfig instances with lambda functions on each call, is correctly addressed by refactoring these configurations into module-level constants. This ensures stable cache keys and prevents unbounded cache growth. Additionally, the change in autotuner.py to pre-compute runner argument names is a welcome performance optimization that avoids repeated calls to inspect.signature. The changes are well-implemented, targeted, and significantly improve the memory efficiency and performance of the autotuning mechanism. The code quality is high, and I have no further suggestions.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a memory leak caused by the AutoTuner's LRU cache when used with dynamically created TuningConfig objects containing lambda functions. The fix moves TuningConfig objects from being created inside functions to module-level global constants, ensuring lambda functions have consistent object identities for proper cache key generation.

  • Extracted TuningConfig objects to module-level constants to prevent dynamic lambda creation
  • Added _pad_up helper function at module level for use in global configs
  • Optimized autotuner by pre-computing runner argument names outside the profiling loop

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
flashinfer/gemm/gemm_base.py Moved TuningConfig objects for fp8_gemm_sm100 and mm_fp4 functions to global module-level constants, and extracted _pad_up helper function to support the global configs
flashinfer/autotuner.py Optimized the choose_one method by pre-computing runner argument names outside the profiling loop to avoid redundant inspect.signature calls

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
flashinfer/gemm/gemm_base.py (2)

2022-2063: MM FP4 tuning configs: indices and padding assumptions are consistent

The MM FP4 configs appear internally consistent:

  • Dynamic spec uses input 0 (a) dim 0 (M), matching the mm_fp4 docstring.
  • ConstraintSpecs target input 2 (a_descale) dim 0 and input 6 (out) dim 0, which matches the inputs list layout in mm_fp4.
  • _pad_up(…, 8) vs _pad_up(…, 128) for the 8x4 vs 128x4 scale‑factor layouts matches the documented layouts and keeps out’s M unpadded.

Defining these as global TuningConfigs is a good fix for the lambda‑based config churn that was polluting the LRU cache.

If you expect mm_fp4’s input ordering to evolve, consider adding small named constants for the tensor indices (e.g. A_TENSOR_IDX = 0, A_DESCALE_IDX = 2, OUT_TENSOR_IDX = 6) and using them in the configs to reduce future drift risk, but this is purely optional.

Also applies to: 2185-2187


692-721: Unbounded LRU on _find_nearest_profile may still cause growth with highly variable shapes

Even after switching to global TuningConfig instances, _find_nearest_profile is cached with @lru_cache(maxsize=None) and keyed by (shapes, tuning_config). If an application feeds in highly diverse shapes (e.g., many different M/N combinations), this cache can still grow without bound over time.

That’s orthogonal to the lambda‑allocation fix in this PR but still relevant to memory usage. If you want to fully harden against long‑running workloads with varying shapes, consider giving this cache a bounded maxsize or adding an explicit invalidation/aging strategy.

flashinfer/autotuner.py (1)

461-467: Precomputing forward-arg names per runner is a safe micro-optimization

Moving the inspect.signature(r.forward) call out of the inner tactic loop and caching param names per runner in runner_arg_names_map is behaviorally equivalent to the prior logic and reduces profiling overhead, especially when there are many tactics. Using the precomputed set to gate the do_preparation call is straightforward and keeps the “only if the runner declares it” contract intact.

If you later find this overhead still noticeable when repeatedly tuning the same runners, you could cache the arg-name set on the runner class or instance itself, but that’s a nice-to-have rather than necessary for this PR.

Also applies to: 480-483

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecd4ef1 and f6626da.

📒 Files selected for processing (2)
  • flashinfer/autotuner.py (2 hunks)
  • flashinfer/gemm/gemm_base.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
flashinfer/autotuner.py (1)
flashinfer/fused_moe/core.py (2)
  • forward (427-461)
  • forward (1030-1190)
🔇 Additional comments (1)
flashinfer/gemm/gemm_base.py (1)

359-373: Global FP8 SM100 tuning config wiring looks correct

The new _FP8_GEMM_SM100_TUNING_CONFIG lines up with fp8_gemm_sm100’s inputs (index 0 = a, 4 = out), and using dim -2 consistently targets the M/token dimension for both 2D and 3D cases. Centralizing this as a module‑level TuningConfig also avoids per‑call lambda/config allocation and plays nicely with the AutoTuner cache design. No changes needed here.

Also applies to: 397-403

@yzh119
Copy link
Collaborator

yzh119 commented Nov 24, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !163 has been created, and the CI pipeline #39090975 is currently running. I'll report back once the pipeline job completes.

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juju812 thanks for your time investigating on the OOM issue and the solution looks reasonable to me (creating a fixed set of TUNING_CONFIGs).

Left one comment.

cc @aleozlx @nvmbreughe for second look.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #39090975: 13/18 passed

…ache and dynamic lambda TuningConfig

Pre-compute runner arg names to avoid calling inspect.signature in the loop
@yzh119 yzh119 enabled auto-merge (squash) November 25, 2025 07:04
@yzh119 yzh119 disabled auto-merge November 25, 2025 19:17
@yzh119 yzh119 merged commit d0d99d2 into flashinfer-ai:main Nov 25, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants