Use global TuningConfig, to fix memory leak caused by AutoTuner LRU cache and dynamic lambda TuningConfig #2140

juju812 · 2025-11-24T10:04:30Z

📌 Description

This PR is to fix a memory leak bug caused by AutoTuner LRU cache and dynamic lambda TuningConfig

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Performance
- Reduced autotuner overhead by caching runner parameter names to avoid repeated signature inspection during profiling, speeding up tuning runs.
New Features
- Centralized reusable tuning presets for mixed-precision GEMM (FP8/FP4) with additional tuning presets to improve autotuning and execution efficiency.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-24T10:04:42Z

Warning

Rate limit exceeded

@yzh119 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 23 minutes and 16 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 79a7ae1 and 79a3721.

📒 Files selected for processing (1)

flashinfer/gemm/gemm_base.py (4 hunks)

Walkthrough

Precompute a mapping from each runner to its forward parameter-name set for reuse during profiling; consolidate GEMM tuning by adding module-level tuning configs and a small helper, and refactor fp8_gemm_sm100 and mm_fp4 to use those shared configs.

Changes

Cohort / File(s)	Change Summary
Profiling Loop Optimization `flashinfer/autotuner.py`	Add `runner_arg_names_map` that maps each runner to the set of parameter names from `r.forward`; reuse this map inside the profiling loop instead of repeatedly calling `inspect.signature(...).parameters`. Invocation now uses the cached set to check for `do_preparation` and to build `**kwargs`.
GEMM Tuning Configuration Consolidation `flashinfer/gemm/gemm_base.py`	Introduce module-level tuning configs: `_FP8_GEMM_SM100_TUNING_CONFIG`, `_MM_FP4_TUNING_CONFIG_8x4`, `_MM_FP4_TUNING_CONFIG_128x4`; add helper `_pad_up(x, y)`; refactor `fp8_gemm_sm100()` and `mm_fp4()` to select and use these shared configs instead of constructing tuning configs inline.

Sequence Diagram(s)

sequenceDiagram
  participant Autotuner
  participant Profiler
  participant Runner as Runner(s)

  rect rgb(221,235,247)
    note right of Autotuner: setup
    Autotuner->>Autotuner: build runner_arg_names_map (runner -> param-name set)
  end

  rect rgb(247,246,221)
    note right of Profiler: profiling loop
    loop each profiling iteration
      Profiler->>Autotuner: request param-name set for runner
      Autotuner-->>Profiler: return cached param-name set
      Profiler->>Runner: invoke Runner.forward(**kwargs) using cached set (check do_preparation)
      Runner-->>Profiler: return result
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Files/areas to review closely:
- correctness and constraints in the new tuning config constants in gemm_base.py
- that mm_fp4 maintains prior layout/selection semantics when switching to the module-level configs
- edge cases and input validation for _pad_up(x, y)
- ensuring autotuner.py cached map covers all runner variants and preserves prior invocation behavior

Poem

🐇 I hopped through code at dawn’s first light,
I cached the names to speed the flight.
Tuning gathered in a single den,
Neat configs now ready — time to zen.
A tiny hop, a faster run — hooray!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly addresses the main purpose of the PR: fixing a memory leak via global TuningConfig instead of dynamic lambdas in AutoTuner.
Description check	✅ Passed	The description includes all required sections: a brief explanation of the memory leak fix, related issue link, and completed pre-commit and test checklists.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-11-24T10:04:47Z

Summary of Changes

Hello @juju812, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a memory leak bug within the AutoTuner's LRU cache. The leak was primarily caused by the continuous dynamic creation of TuningConfig objects and the repeated use of introspection for runner argument names. The solution involves promoting these TuningConfig objects to global constants and pre-computing argument names, thereby reducing object churn and improving overall memory management and performance.

Highlights

Memory Leak Fix: Addresses a memory leak caused by the AutoTuner's LRU cache and the dynamic creation of TuningConfig objects.
Global TuningConfig: Refactors TuningConfig instances for fp8_gemm_sm100 and mm_fp4 operations into global constants, preventing their repeated dynamic instantiation.
Performance Optimization: Optimizes the AutoTuner.choose_one method by pre-computing runner argument names, reducing the overhead of repeated inspect.signature calls within a loop.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively resolves a memory leak associated with the AutoTuner's LRU cache. The root cause, which was the dynamic creation of TuningConfig instances with lambda functions on each call, is correctly addressed by refactoring these configurations into module-level constants. This ensures stable cache keys and prevents unbounded cache growth. Additionally, the change in autotuner.py to pre-compute runner argument names is a welcome performance optimization that avoids repeated calls to inspect.signature. The changes are well-implemented, targeted, and significantly improve the memory efficiency and performance of the autotuning mechanism. The code quality is high, and I have no further suggestions.

Copilot

Pull request overview

This PR fixes a memory leak caused by the AutoTuner's LRU cache when used with dynamically created TuningConfig objects containing lambda functions. The fix moves TuningConfig objects from being created inside functions to module-level global constants, ensuring lambda functions have consistent object identities for proper cache key generation.

Extracted TuningConfig objects to module-level constants to prevent dynamic lambda creation
Added _pad_up helper function at module level for use in global configs
Optimized autotuner by pre-computing runner argument names outside the profiling loop

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
flashinfer/gemm/gemm_base.py	Moved TuningConfig objects for `fp8_gemm_sm100` and `mm_fp4` functions to global module-level constants, and extracted `_pad_up` helper function to support the global configs
flashinfer/autotuner.py	Optimized the `choose_one` method by pre-computing runner argument names outside the profiling loop to avoid redundant `inspect.signature` calls

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

flashinfer/gemm/gemm_base.py (2)

2022-2063: MM FP4 tuning configs: indices and padding assumptions are consistent

The MM FP4 configs appear internally consistent:

Dynamic spec uses input 0 (a) dim 0 (M), matching the mm_fp4 docstring.

ConstraintSpecs target input 2 (a_descale) dim 0 and input 6 (out) dim 0, which matches the inputs list layout in mm_fp4.

_pad_up(…, 8) vs _pad_up(…, 128) for the 8x4 vs 128x4 scale‑factor layouts matches the documented layouts and keeps out’s M unpadded.

Defining these as global TuningConfigs is a good fix for the lambda‑based config churn that was polluting the LRU cache.

If you expect mm_fp4’s input ordering to evolve, consider adding small named constants for the tensor indices (e.g. A_TENSOR_IDX = 0, A_DESCALE_IDX = 2, OUT_TENSOR_IDX = 6) and using them in the configs to reduce future drift risk, but this is purely optional.

Also applies to: 2185-2187

692-721: Unbounded LRU on _find_nearest_profile may still cause growth with highly variable shapes

Even after switching to global TuningConfig instances, _find_nearest_profile is cached with @lru_cache(maxsize=None) and keyed by (shapes, tuning_config). If an application feeds in highly diverse shapes (e.g., many different M/N combinations), this cache can still grow without bound over time.

That’s orthogonal to the lambda‑allocation fix in this PR but still relevant to memory usage. If you want to fully harden against long‑running workloads with varying shapes, consider giving this cache a bounded maxsize or adding an explicit invalidation/aging strategy.

flashinfer/autotuner.py (1)

461-467: Precomputing forward-arg names per runner is a safe micro-optimization

Moving the inspect.signature(r.forward) call out of the inner tactic loop and caching param names per runner in runner_arg_names_map is behaviorally equivalent to the prior logic and reduces profiling overhead, especially when there are many tactics. Using the precomputed set to gate the do_preparation call is straightforward and keeps the “only if the runner declares it” contract intact.

If you later find this overhead still noticeable when repeatedly tuning the same runners, you could cache the arg-name set on the runner class or instance itself, but that’s a nice-to-have rather than necessary for this PR.

Also applies to: 480-483

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecd4ef1 and f6626da.

📒 Files selected for processing (2)

flashinfer/autotuner.py (2 hunks)
flashinfer/gemm/gemm_base.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

flashinfer/autotuner.py (1)

flashinfer/fused_moe/core.py (2)

forward (427-461)

forward (1030-1190)

🔇 Additional comments (1)

flashinfer/gemm/gemm_base.py (1)

359-373: Global FP8 SM100 tuning config wiring looks correct

The new _FP8_GEMM_SM100_TUNING_CONFIG lines up with fp8_gemm_sm100’s inputs (index 0 = a, 4 = out), and using dim -2 consistently targets the M/token dimension for both 2D and 3D cases. Centralizing this as a module‑level TuningConfig also avoids per‑call lambda/config allocation and plays nicely with the AutoTuner cache design. No changes needed here.

Also applies to: 397-403

yzh119 · 2025-11-24T19:43:11Z

/bot run

flashinfer-bot · 2025-11-24T19:44:03Z

GitLab MR !163 has been created, and the CI pipeline #39090975 is currently running. I'll report back once the pipeline job completes.

yzh119

@juju812 thanks for your time investigating on the OOM issue and the solution looks reasonable to me (creating a fixed set of TUNING_CONFIGs).

Left one comment.

cc @aleozlx @nvmbreughe for second look.

flashinfer/autotuner.py

flashinfer-bot · 2025-11-25T00:57:56Z

[FAILED] Pipeline #39090975: 13/18 passed

…ache and dynamic lambda TuningConfig Pre-compute runner arg names to avoid calling inspect.signature in the loop

Copilot AI review requested due to automatic review settings November 24, 2025 10:04

juju812 requested review from aleozlx, bkryu, cyx-6, nvmbreughe, wenscarl and yzh119 as code owners November 24, 2025 10:04

Copilot started reviewing on behalf of juju812 November 24, 2025 10:05 View session

Copilot finished reviewing on behalf of juju812 November 24, 2025 10:06

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

Copilot AI reviewed Nov 24, 2025

View reviewed changes

coderabbitai bot reviewed Nov 24, 2025

View reviewed changes

yzh119 approved these changes Nov 24, 2025

View reviewed changes

flashinfer/autotuner.py Show resolved Hide resolved

yzh119 approved these changes Nov 25, 2025

View reviewed changes

use global TuningConfig, to fix memory leak caused by AutoTuner LRU c…

79a7ae1

…ache and dynamic lambda TuningConfig Pre-compute runner arg names to avoid calling inspect.signature in the loop

juju812 force-pushed the main branch from 2138e94 to 79a7ae1 Compare November 25, 2025 03:02

pre-commit

79a3721

yzh119 enabled auto-merge (squash) November 25, 2025 07:04

cyx-6 approved these changes Nov 25, 2025

View reviewed changes

yzh119 disabled auto-merge November 25, 2025 19:17

yzh119 merged commit d0d99d2 into flashinfer-ai:main Nov 25, 2025
4 checks passed

yzh119 mentioned this pull request Nov 25, 2025

NVFP4 CPU RAM leak caused by dynamic creation of TuningConfig objects #2139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use global TuningConfig, to fix memory leak caused by AutoTuner LRU cache and dynamic lambda TuningConfig #2140

Use global TuningConfig, to fix memory leak caused by AutoTuner LRU cache and dynamic lambda TuningConfig #2140

juju812 commented Nov 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 24, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

gemini-code-assist bot commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

yzh119 commented Nov 24, 2025

Uh oh!

flashinfer-bot commented Nov 24, 2025

Uh oh!

yzh119 left a comment

Uh oh!

Uh oh!

flashinfer-bot commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use global TuningConfig, to fix memory leak caused by AutoTuner LRU cache and dynamic lambda TuningConfig #2140

Use global TuningConfig, to fix memory leak caused by AutoTuner LRU cache and dynamic lambda TuningConfig #2140

Conversation

juju812 commented Nov 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Nov 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Nov 24, 2025

Uh oh!

flashinfer-bot commented Nov 24, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flashinfer-bot commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juju812 commented Nov 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 24, 2025 •

edited

Loading