-
Notifications
You must be signed in to change notification settings - Fork 583
Use global TuningConfig, to fix memory leak caused by AutoTuner LRU cache and dynamic lambda TuningConfig #2140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@yzh119 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 23 minutes and 16 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughPrecompute a mapping from each runner to its Changes
Sequence Diagram(s)sequenceDiagram
participant Autotuner
participant Profiler
participant Runner as Runner(s)
rect rgb(221,235,247)
note right of Autotuner: setup
Autotuner->>Autotuner: build runner_arg_names_map (runner -> param-name set)
end
rect rgb(247,246,221)
note right of Profiler: profiling loop
loop each profiling iteration
Profiler->>Autotuner: request param-name set for runner
Autotuner-->>Profiler: return cached param-name set
Profiler->>Runner: invoke Runner.forward(**kwargs) using cached set (check do_preparation)
Runner-->>Profiler: return result
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @juju812, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a memory leak bug within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a memory leak associated with the AutoTuner's LRU cache. The root cause, which was the dynamic creation of TuningConfig instances with lambda functions on each call, is correctly addressed by refactoring these configurations into module-level constants. This ensures stable cache keys and prevents unbounded cache growth. Additionally, the change in autotuner.py to pre-compute runner argument names is a welcome performance optimization that avoids repeated calls to inspect.signature. The changes are well-implemented, targeted, and significantly improve the memory efficiency and performance of the autotuning mechanism. The code quality is high, and I have no further suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a memory leak caused by the AutoTuner's LRU cache when used with dynamically created TuningConfig objects containing lambda functions. The fix moves TuningConfig objects from being created inside functions to module-level global constants, ensuring lambda functions have consistent object identities for proper cache key generation.
- Extracted TuningConfig objects to module-level constants to prevent dynamic lambda creation
- Added
_pad_uphelper function at module level for use in global configs - Optimized autotuner by pre-computing runner argument names outside the profiling loop
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| flashinfer/gemm/gemm_base.py | Moved TuningConfig objects for fp8_gemm_sm100 and mm_fp4 functions to global module-level constants, and extracted _pad_up helper function to support the global configs |
| flashinfer/autotuner.py | Optimized the choose_one method by pre-computing runner argument names outside the profiling loop to avoid redundant inspect.signature calls |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
flashinfer/gemm/gemm_base.py (2)
2022-2063: MM FP4 tuning configs: indices and padding assumptions are consistentThe MM FP4 configs appear internally consistent:
- Dynamic spec uses input 0 (
a) dim 0 (M), matching themm_fp4docstring.- ConstraintSpecs target input 2 (
a_descale) dim 0 and input 6 (out) dim 0, which matches theinputslist layout inmm_fp4._pad_up(…, 8)vs_pad_up(…, 128)for the 8x4 vs 128x4 scale‑factor layouts matches the documented layouts and keepsout’s M unpadded.Defining these as global
TuningConfigs is a good fix for the lambda‑based config churn that was polluting the LRU cache.If you expect
mm_fp4’s input ordering to evolve, consider adding small named constants for the tensor indices (e.g.A_TENSOR_IDX = 0,A_DESCALE_IDX = 2,OUT_TENSOR_IDX = 6) and using them in the configs to reduce future drift risk, but this is purely optional.Also applies to: 2185-2187
692-721: Unbounded LRU on_find_nearest_profilemay still cause growth with highly variable shapesEven after switching to global
TuningConfiginstances,_find_nearest_profileis cached with@lru_cache(maxsize=None)and keyed by(shapes, tuning_config). If an application feeds in highly diverse shapes (e.g., many different M/N combinations), this cache can still grow without bound over time.That’s orthogonal to the lambda‑allocation fix in this PR but still relevant to memory usage. If you want to fully harden against long‑running workloads with varying shapes, consider giving this cache a bounded
maxsizeor adding an explicit invalidation/aging strategy.flashinfer/autotuner.py (1)
461-467: Precomputing forward-arg names per runner is a safe micro-optimizationMoving the
inspect.signature(r.forward)call out of the inner tactic loop and caching param names per runner inrunner_arg_names_mapis behaviorally equivalent to the prior logic and reduces profiling overhead, especially when there are many tactics. Using the precomputed set to gate thedo_preparationcall is straightforward and keeps the “only if the runner declares it” contract intact.If you later find this overhead still noticeable when repeatedly tuning the same runners, you could cache the arg-name set on the runner class or instance itself, but that’s a nice-to-have rather than necessary for this PR.
Also applies to: 480-483
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
flashinfer/autotuner.py(2 hunks)flashinfer/gemm/gemm_base.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
flashinfer/autotuner.py (1)
flashinfer/fused_moe/core.py (2)
forward(427-461)forward(1030-1190)
🔇 Additional comments (1)
flashinfer/gemm/gemm_base.py (1)
359-373: Global FP8 SM100 tuning config wiring looks correctThe new
_FP8_GEMM_SM100_TUNING_CONFIGlines up withfp8_gemm_sm100’s inputs (index 0 =a, 4 =out), and using dim-2consistently targets the M/token dimension for both 2D and 3D cases. Centralizing this as a module‑levelTuningConfigalso avoids per‑call lambda/config allocation and plays nicely with theAutoTunercache design. No changes needed here.Also applies to: 397-403
|
/bot run |
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juju812 thanks for your time investigating on the OOM issue and the solution looks reasonable to me (creating a fixed set of TUNING_CONFIGs).
Left one comment.
cc @aleozlx @nvmbreughe for second look.
|
[FAILED] Pipeline #39090975: 13/18 passed |
…ache and dynamic lambda TuningConfig Pre-compute runner arg names to avoid calling inspect.signature in the loop
📌 Description
This PR is to fix a memory leak bug caused by AutoTuner LRU cache and dynamic lambda TuningConfig
🔍 Related Issues
#2139
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.