Skip to content

Conversation

@lsy323
Copy link
Collaborator

@lsy323 lsy323 commented Jun 3, 2025

tests/tpu/test_compilation.py checks the FX graph dump from torch.compile. It used to generate FX graphs with dump file name like __compiled_fn_15.Captured_Graph.0.py, using the latest PyTorch nightly, it doesn't generate this dump anymore. There are dumps like __compiled_fn_13.Forward_graph.0.py that is the captured compiled code for the forward function of the compiled nn.Module. Let's check the __compiled_fn_*.Forward_graph.*.py in the tests instead.

Signed-off-by: Siyuan Liu <lsiyuan@google.com>
@github-actions
Copy link

github-actions bot commented Jun 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @lsy323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This pull request, authored by @lsy323, addresses a breaking change in the tests/tpu/test_compilation.py test file. The test relies on finding specific dump files generated by torch.compile to verify the compiled FX graphs. However, recent PyTorch nightly builds have changed the naming convention for these dump files.

The PR updates the test to look for the new file name pattern (__compiled_fn*Forward_graph*.py) instead of the old one (__compiled_fn*Captured*.py), ensuring the test can correctly locate and analyze the compiled graph dumps generated by the latest PyTorch versions. Additionally, a breakpoint() was added, likely for debugging purposes during development.

Highlights

  • Test Fix: Updates the tests/tpu/test_compilation.py test to adapt to a change in torch.compile's dump file naming convention in recent PyTorch nightlies.
  • File Pattern Update: Changes the glob pattern used to find compiled graph dump files from __compiled_fn*Captured*.py to __compiled_fn*Forward_graph*.py.
  • Debugging Aid: Adds a breakpoint() within the test_compilation function for easier debugging.

Changelog

  • tests/tpu/test_compilation.py
    • Updated the glob pattern on line 69 to search for __compiled_fn*Forward_graph*.py files.
    • Added a breakpoint() on line 75.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Test fails, name changed,
Glob pattern now updated,
Green checkmark returns.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the tpu Related to Google TPUs label Jun 3, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request updates the test to check for the correct FX graph dump file name. The change seems straightforward and necessary due to updates in PyTorch nightly. However, I have a few suggestions to improve the code's clarity and robustness.

Summary of Findings

  • Unintentional breakpoint: The breakpoint() call should be removed before merging to prevent accidental pausing of test execution.
  • Lack of context for filename change: Adding a comment explaining the reason for the filename change would improve code maintainability.

Merge Readiness

The pull request addresses an important issue with the test, but the leftover breakpoint() call is a significant concern. I recommend removing the breakpoint and adding a clarifying comment before merging. I am unable to approve this pull request, and recommend that others review and approve this code before merging. At a minimum, the critical and high severity comments should be addressed before merging.

Comment on lines 68 to +69
compiled_fns = sorted(glob.glob(
os.path.join(temp_dir, "__compiled_fn*Captured*.py")),
os.path.join(temp_dir, "__compiled_fn*Forward_graph*.py")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a comment explaining why __compiled_fn*Forward_graph*.py is now being checked instead of __compiled_fn*Captured*.py. This will help future developers understand the change and its context. It would be useful to add a brief explanation of what the Forward_graph represents.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks gemini, added a comment

lsy323 added 3 commits June 3, 2025 22:09
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
@lsy323 lsy323 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2025
Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this!

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lsy323
Copy link
Collaborator Author

lsy323 commented Jun 4, 2025

I hit Error: Failed to close writer for Some("/root/.cache/huggingface/xet/https___cas_serv-tGqkUaZf_CBPHQ6h/chunk-cache/j3/j3UYm-qsYwQwAkBZmyeVyr8uxV-jqjzqNFOBB7WaSIxkZWZhdWx0/GQAAACcEAABGUucDAAAAADjAAgM="): No space left on device (os error 28) for the test tests/v1/tpu/test_spmd_model_weight_loading.py::test_tpu_model_loader[meta-llama/Llama-3.1-8B-Instruct]

Let me skip this for now, until we have a larger disk for the TPU CI Runner

lsy323 added 2 commits June 4, 2025 04:09
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
@lsy323
Copy link
Collaborator Author

lsy323 commented Jun 4, 2025

Let me merge #19115 with this PR. These 2 PRs should fix all issues now

@lsy323 lsy323 mentioned this pull request Jun 4, 2025
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
@mergify
Copy link

mergify bot commented Jun 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lsy323.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 4, 2025
lsy323 added 2 commits June 4, 2025 16:54
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
@mergify mergify bot removed the needs-rebase label Jun 4, 2025
@lsy323
Copy link
Collaborator Author

lsy323 commented Jun 4, 2025

Hi @sarckk, thank you for supporting cross-layer KV sharing for TPU backend. I found the tests are failing on TPU CI in the original PR. (CI log). TPU CI has been soft-failing, so it shows the green check mark even if it fails ( I know it's confusing!). I will disable it in this PR to bring TPU CI back to green. Thanks!

@sarckk
Copy link
Collaborator

sarckk commented Jun 4, 2025

@lsy323 thanks for the heads up! sorry for breaking it, I'll put up a fix soon.

@lsy323
Copy link
Collaborator Author

lsy323 commented Jun 4, 2025

@lsy323 thanks for the heads up! sorry for breaking it, I'll put up a fix soon.

Hi @sarckk, thank you for taking a look! If you don't have a TPU VM on hand, let me know I'd be happy to test it out for you.

The error I got locally is as follows. It seems there are 2 issues:

  1. vllm_config is not set correctly in the test
  2. The max seq len set in the test is too long for TPU. The TPU runner has 16GB HBM.
FAILED tests/v1/tpu/worker/test_tpu_model_runner.py::test_init_kv_cache_with_kv_sharing_invalid_target_layer_order - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED tests/v1/tpu/worker/test_tpu_model_runner.py::test_init_kv_cache_with_kv_sharing_target_layer_not_exist - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED tests/v1/tpu/worker/test_tpu_model_runner.py::test_init_kv_cache_with_kv_sharing_target_same_as_current - AttributeError: 'NoneType' object has no attribute 'dtype'
FAILED tests/v1/tpu/worker/test_tpu_model_runner.py::test_init_kv_cache_without_kv_sharing - ValueError: To serve at least one request with the models's max seq len (3000000), (45.78 GiB KV cache is needed, which is larger than the available KV ...
FAILED tests/v1/tpu/worker/test_tpu_model_runner.py::test_init_kv_cache_with_kv_sharing_valid - ValueError: To serve at least one request with the models's max seq len (3000000), (22.89 GiB KV cache is needed, which is larger than the available KV ...

@sarckk
Copy link
Collaborator

sarckk commented Jun 4, 2025

@lsy323 yup I realized I don't have access to a TPU to test, but here's my attempt at a fix: #19155. Could you try it out locally and let me know if it fixes the tests? thanks!

@mgoin mgoin merged commit 7ee2590 into vllm-project:main Jun 4, 2025
46 checks passed
@lsy323
Copy link
Collaborator Author

lsy323 commented Jun 4, 2025

@lsy323 yup I realized I don't have access to a TPU to test, but here's my attempt at a fix: #19155. Could you try it out locally and let me know if it fixes the tests? thanks!

Thanks @sarckk let me give it a try and follow up in the new PR!

@lsy323 lsy323 deleted the lsiyuan/fix-compilation-test branch June 4, 2025 20:13
leoli1208 pushed a commit to leoli1208/vllm that referenced this pull request Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants