[V1] [CI] Enable `v1/entrypoints` #14619

robertgshaw2-redhat · 2025-03-11T14:21:21Z

SUMMARY:

Enable v1/entrypoints tests

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

github-actions · 2025-03-11T14:21:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

russellb

lgtm thanks!

I wish I knew why spawn was required, but if this makes it work for now, fine with me

mergify · 2025-03-11T15:04:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat · 2025-03-11T15:32:08Z

lgtm thanks!

I wish I knew why spawn was required, but if this makes it work for now, fine with me

I think it was because I ran on TP>1 locally. Will remove

russellb · 2025-03-11T15:36:10Z

lgtm thanks!
I wish I knew why spawn was required, but if this makes it work for now, fine with me

I think it was because I ran on TP>1 locally. Will remove

@markmc saw failures with TP=1 I think, but I wasn't able to reproduce

.buildkite/test-pipeline.yaml

markmc · 2025-03-11T15:44:58Z

See #14579 for the failure

markmc · 2025-03-11T15:47:19Z

#14512 is required to fix v1/entrypoints/openai/test_completion.py

.buildkite/test-pipeline.yaml

markmc · 2025-03-13T11:29:59Z

#14512 is required to fix v1/entrypoints/openai/test_completion.py

Merged now, so can rebase onto it

markmc · 2025-03-13T11:31:19Z

See #14579 for the failure

Still expecting tests/v1/entrypoints/llm/test_struct_output_generate.py to fail with:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Reproducer:

VLLM_USE_V1=1 pytest -s -v 'tests/v1/entrypoints/llm/test_struct_output_generate.py::test_guided_grammar_ebnf[xgrammar]' 'tests/v1/entrypoints/llm/test_struct_output_generate.py::test_guided_grammar_lark[xgrammar]'

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

DarkLight1337 · 2025-03-14T16:38:06Z

I get this error now:

tests/v1/entrypoints/llm/test_struct_output_generate.py::test_guided_json_completion[Qwen/Qwen2.5-1.5B-Instruct-xgrammar]
INFO 03-14 16:35:43 [__init__.py:32] name=register_dummy_model, value=vllm_add_dummy_model:register
INFO 03-14 16:35:43 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 03-14 16:35:43 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-14 16:35:43 [__init__.py:44] plugin register_dummy_model loaded.
WARNING 03-14 16:35:43 [arg_utils.py:1478] Setting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
INFO 03-14 16:35:57 [config.py:581] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 03-14 16:35:57 [config.py:1671] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-14 16:35:58 [core.py:53] Initializing a V1 LLM engine (v0.7.4.dev424+ga4c924b00.d20250312) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-14 16:36:00 [utils.py:2304] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f72a0ab2c70>
INFO 03-14 16:36:02 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-14 16:36:02 [cuda.py:215] Using Flash Attention backend on V1 engine.
INFO 03-14 16:36:02 [gpu_model_runner.py:1112] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
WARNING 03-14 16:36:02 [topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 03-14 16:36:03 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 03-14 16:36:03 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.49s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.49s/it]

INFO 03-14 16:36:05 [loader.py:429] Loading weights took 1.62 seconds
INFO 03-14 16:36:05 [gpu_model_runner.py:1124] Model loading took 2.8875 GB and 3.048408 seconds
INFO 03-14 16:36:20 [backends.py:409] Using cache directory: /home/cyrus/.cache/vllm/torch_compile_cache/96a4ca999c/rank_0_0 for vLLM's torch.compile
INFO 03-14 16:36:20 [backends.py:419] Dynamo bytecode transform time: 14.99 s
INFO 03-14 16:36:21 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 03-14 16:36:33 [monitor.py:33] torch.compile takes 14.99 s in total
INFO 03-14 16:36:34 [kv_cache_utils.py:537] GPU KV cache size: 409,856 tokens
INFO 03-14 16:36:34 [kv_cache_utils.py:540] Maximum concurrency for 1,024 tokens per request: 400.25x
INFO 03-14 16:37:00 [gpu_model_runner.py:1434] Graph capturing finished in 26 secs, took 1.42 GiB
INFO 03-14 16:37:00 [core.py:138] init engine (profile, create kv cache, warmup model) took 54.44 seconds
Processed prompts:   0%|                                                                                                                      | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]ERROR 03-14 16:37:04 [core.py:337] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/engine/core.py", line 330, in run_engine_core
ERROR 03-14 16:37:04 [core.py:337]     engine_core.run_busy_loop()
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/engine/core.py", line 364, in run_busy_loop
ERROR 03-14 16:37:04 [core.py:337]     outputs = step_fn()
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/engine/core.py", line 193, in step
ERROR 03-14 16:37:04 [core.py:337]     engine_core_outputs = self.scheduler.update_from_output(
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/core/scheduler.py", line 621, in update_from_output
ERROR 03-14 16:37:04 [core.py:337]     request.structured_output_request.grammar.accept_tokens(  # type: ignore[union-attr]
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/structured_output/grammar.py", line 57, in accept_tokens
ERROR 03-14 16:37:04 [core.py:337]     if not self.matcher.accept_token(token):
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/miniconda3/envs/vllm/lib/python3.9/site-packages/xgrammar/matcher.py", line 220, in accept_token
ERROR 03-14 16:37:04 [core.py:337]     return self._handle.accept_token(token_id, debug_print)
ERROR 03-14 16:37:04 [core.py:337] RuntimeError: [16:37:04] /project/cpp/grammar_matcher.cc:362: Check failed: (token_id >= 0 && token_id < tokenizer_info_.GetVocabSize()) is false: Invalid token id 151850 for GrammarMatcher
ERROR 03-14 16:37:04 [core.py:337] 
ERROR 03-14 16:37:04 [core.py:337] 
CRITICAL 03-14 16:37:04 [core_client.py:260] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

DarkLight1337 · 2025-03-14T16:38:31Z

This is with @russellb 's latest commits merged in

russellb · 2025-03-14T19:00:36Z

This is with @russellb 's latest commits merged in

I'm working on this over here: #14619

still trying to fully understand the problem, but I think the tests are passing there now (they are locally, still running in CI)

russellb · 2025-03-14T19:01:13Z

This is with @russellb 's latest commits merged in

I'm working on this over here: #14619

still trying to fully understand the problem, but I think the tests are passing there now (they are locally, still running in CI)

sorry, I'm mixing up PRs ... here: #14832

russellb · 2025-03-14T19:02:00Z

and I think the 2 PRs are duplicates?

updated

2fe9eed

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

robertgshaw2-redhat requested review from mgoin and russellb as code owners March 11, 2025 14:21

mergify bot added ci/build v1 labels Mar 11, 2025

rshaw@neuralmagic.com added 2 commits March 11, 2025 14:26

Merge branch 'main' into enable-v1-tests

8b25b46

updated

560563c

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

russellb approved these changes Mar 11, 2025

View reviewed changes

russellb enabled auto-merge (squash) March 11, 2025 15:02

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2025

mergify bot added the needs-rebase label Mar 11, 2025

Merge branch 'main' into enable-v1-tests

949c8ec

mergify bot removed the needs-rebase label Mar 11, 2025

Update test-pipeline.yaml

a70b25b

markmc reviewed Mar 11, 2025

View reviewed changes

.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved

russellb disabled auto-merge March 11, 2025 15:46

markmc mentioned this pull request Mar 11, 2025

[WIP][V1][Structured Output] Enable structured output test #14579

Closed

Update test-pipeline.yaml

73d5cb1

markmc reviewed Mar 12, 2025

View reviewed changes

.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved

markmc added this to the v0.8.0 milestone Mar 13, 2025

russellb mentioned this pull request Mar 14, 2025

[V1] Fix vocab size calculation for structured output #14826

Merged

Apply suggestion

e049d17

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

DarkLight1337 added 2 commits March 14, 2025 16:24

Merge branch 'main' into enable-v1-tests

6a26357

Don't use fixture

5e50b81

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

robertgshaw2-redhat closed this Mar 14, 2025

DarkLight1337 removed this from the v0.8.0 milestone Mar 15, 2025

robertgshaw2-redhat deleted the enable-v1-tests branch March 24, 2025 18:04

Uh oh!

[V1] [CI] Enable v1/entrypoints #14619

[V1] [CI] Enable v1/entrypoints #14619

Uh oh!

Conversation

robertgshaw2-redhat commented Mar 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2025

Uh oh!

russellb left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 11, 2025

Uh oh!

robertgshaw2-redhat commented Mar 11, 2025

Uh oh!

russellb commented Mar 11, 2025

Uh oh!

Uh oh!

markmc commented Mar 11, 2025

Uh oh!

markmc commented Mar 11, 2025

Uh oh!

Uh oh!

markmc commented Mar 13, 2025

Uh oh!

markmc commented Mar 13, 2025

Uh oh!

DarkLight1337 commented Mar 14, 2025

Uh oh!

DarkLight1337 commented Mar 14, 2025

Uh oh!

russellb commented Mar 14, 2025

Uh oh!

russellb commented Mar 14, 2025

Uh oh!

russellb commented Mar 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[V1] [CI] Enable `v1/entrypoints` #14619

[V1] [CI] Enable `v1/entrypoints` #14619

robertgshaw2-redhat commented Mar 11, 2025 •

edited by github-actions bot

Loading