Skip to content

Conversation

@noooop
Copy link
Collaborator

@noooop noooop commented Sep 28, 2025

Improve all pooling task

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

Fix some minor issues found during the implementation of this series to keep subsequent PRs cleaner.

  • add embedding_size
  • add VLLM_CI_ENFORCE_EAGER & get_vllm_extra_kwargs
  • ner.py -> ner_client.py
  • fix gpt2 seq cls & add test_head_dtype.py
  • clean GritLM
  • fix skip_global_cleanup
  • fix test_splade_pooler

Test Plan

keep ci grean

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Sep 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 28, 2025
@noooop noooop closed this Sep 28, 2025
@noooop noooop reopened this Sep 28, 2025
@mergify mergify bot removed the needs-rebase label Sep 28, 2025
@noooop noooop force-pushed the embed_e2e branch 2 times, most recently from f13cd39 to d9cd590 Compare September 28, 2025 05:20
@noooop noooop changed the title [Performance] Embedding Models E2E Performance Optimization [Model][0/N] Improve all pooling task | clean up Oct 11, 2025
@mergify
Copy link

mergify bot commented Oct 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop reopened this Oct 11, 2025
@mergify mergify bot removed the needs-rebase label Oct 11, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify
Copy link

mergify bot commented Oct 11, 2025

Documentation preview: https://vllm--25817.org.readthedocs.build/en/25817/

@mergify mergify bot added the documentation Improvements or additions to documentation label Oct 11, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop marked this pull request as ready for review October 11, 2025 08:58
@noooop
Copy link
Collaborator Author

noooop commented Oct 11, 2025

cc @DarkLight1337

Ready for review

@noooop noooop added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 13, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
@DarkLight1337
Copy link
Member

The pooling tests are still failing

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Collaborator Author

noooop commented Oct 13, 2025

The pooling tests are still failing

�[1;36m(EngineCore_DP0 pid=6682)�[0;0m INFO 10-13 15:07:35 [cuda.py:421] Using FlexAttention backend for head_size=8 on V1 engine.

......

�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/PycharmProjects/noooop_vllm5/vllm/v1/attention/backends/flex_attention.py", line 843, in forward
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     out = flex_attention_compiled(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m           ^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 749, in compile_wrapper
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 923, in _compile_fx_inner
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     raise InductorError(e, currentframe()).with_traceback(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 907, in _compile_fx_inner
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     mb_compiled_graph = fx_codegen_and_compile(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m                         ^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1578, in fx_codegen_and_compile
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1377, in codegen_and_compile
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     graph.run(*example_inputs)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/graph.py", line 921, in run
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     return super().run(*args)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m            ^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/fx/interpreter.py", line 173, in run
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     self.env[node] = self.run_node(node)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m                      ^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1599, in run_node
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     result = super().run_node(n)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m              ^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/fx/interpreter.py", line 242, in run_node
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     return getattr(self, n.op)(n.target, args, kwargs)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1268, in call_function
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     raise LoweringException(e, target, args, kwargs).with_traceback(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/graph.py", line 1258, in call_function
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     out = lowerings[target](*args, **kwargs)  # type: ignore[index]
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/lowering.py", line 446, in wrapped
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     out = decomp_fn(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m           ^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   File "/share/anaconda3/envs/noooop_vllm5/lib/python3.12/site-packages/torch/_inductor/kernel/flex_attention.py", line 1311, in flex_attention
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     raise NotImplementedError(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m torch._inductor.exc.InductorError: LoweringException: NotImplementedError: NYI: embedding dimension of the query, key, and value must be at least 16 but got E=8 and Ev=8
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   target: flex_attention
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[0]: TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1, 4, 12, 8], stride=[1152, 8, 96, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   ))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[1]: TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[1, 4, 12, 8], stride=[1152, 8, 96, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   ))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[2]: TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float16, size=[1, 4, 12, 8], stride=[1152, 8, 96, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   ))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[3]: Subgraph(name='sdpa_score0', graph_module=<lambda>(), graph=None)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[4]: (12, 12, TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg4_1', layout=FixedLayout('cuda:0', torch.int32, size=[1, 1, 1], stride=[1, 1, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )), TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg3_1', layout=FixedLayout('cuda:0', torch.int32, size=[1, 1, 1, 1], stride=[1, 1, 1, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )), TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg6_1', layout=FixedLayout('cuda:0', torch.int32, size=[1, 1, 1], stride=[1, 1, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )), TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg7_1', layout=FixedLayout('cuda:0', torch.int32, size=[1, 1, 1, 1], stride=[1, 1, 1, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )), TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg8_1', layout=FixedLayout('cuda:0', torch.int32, size=[1, 1, 1], stride=[1, 1, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )), TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg9_1', layout=FixedLayout('cuda:0', torch.int32, size=[1, 1, 1, 1], stride=[1, 1, 1, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )), TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg10_1', layout=FixedLayout('cuda:0', torch.int32, size=[1, 1, 1], stride=[1, 1, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )), TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg11_1', layout=FixedLayout('cuda:0', torch.int32, size=[1, 1, 1, 1], stride=[1, 1, 1, 1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )), 128, 128, Subgraph(name='sdpa_mask0', graph_module=<lambda>(), graph=None))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[5]: 0.3535533905932738
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[6]: {'FORCE_USE_FLEX_ATTENTION': True, 'BLOCK_M': 32, 'BLOCK_N': 32, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': False}
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[7]: ()
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   args[8]: (TensorBox(StorageBox(
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m     InputBuffer(name='arg5_1', layout=FixedLayout('cuda:0', torch.int32, size=[12], stride=[1]))
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m   )),)
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m 
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
�[1;36m(EngineCore_DP0 pid=6682)�[0;0m 
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

After some investigation.

I will delete test_bert_splade_sparse_embed_smoke because I think vllm is likely not supported for hf-internal-testing/tiny-random-bert because its head_size=8

@DarkLight1337
Copy link
Member

cc @gjgjos

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 13, 2025 08:24
@DarkLight1337 DarkLight1337 merged commit 767c3ab into vllm-project:main Oct 13, 2025
58 checks passed
@noooop noooop deleted the embed_e2e branch October 13, 2025 08:45
bbeckca pushed a commit to bbeckca/vllm that referenced this pull request Oct 13, 2025
1994 pushed a commit to 1994/vllm that referenced this pull request Oct 14, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: 1994 <1994@users.noreply.github.com>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants