Skip to content

Conversation

@robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Mar 11, 2025

SUMMARY:

  • Enable v1/entrypoints tests

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

rshaw@neuralmagic.com added 2 commits March 11, 2025 14:26
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks!

I wish I knew why spawn was required, but if this makes it work for now, fine with me

@russellb russellb enabled auto-merge (squash) March 11, 2025 15:02
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2025
@mergify
Copy link

mergify bot commented Mar 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 11, 2025
@robertgshaw2-redhat
Copy link
Collaborator Author

lgtm thanks!

I wish I knew why spawn was required, but if this makes it work for now, fine with me

I think it was because I ran on TP>1 locally. Will remove

@mergify mergify bot removed the needs-rebase label Mar 11, 2025
@russellb
Copy link
Member

lgtm thanks!
I wish I knew why spawn was required, but if this makes it work for now, fine with me

I think it was because I ran on TP>1 locally. Will remove

@markmc saw failures with TP=1 I think, but I wasn't able to reproduce

@markmc
Copy link
Member

markmc commented Mar 11, 2025

See #14579 for the failure

@russellb russellb disabled auto-merge March 11, 2025 15:46
@markmc
Copy link
Member

markmc commented Mar 11, 2025

#14512 is required to fix v1/entrypoints/openai/test_completion.py

@markmc markmc added this to the v0.8.0 milestone Mar 13, 2025
@markmc
Copy link
Member

markmc commented Mar 13, 2025

#14512 is required to fix v1/entrypoints/openai/test_completion.py

Merged now, so can rebase onto it

@markmc
Copy link
Member

markmc commented Mar 13, 2025

See #14579 for the failure

Still expecting tests/v1/entrypoints/llm/test_struct_output_generate.py to fail with:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Reproducer:

VLLM_USE_V1=1 pytest -s -v 'tests/v1/entrypoints/llm/test_struct_output_generate.py::test_guided_grammar_ebnf[xgrammar]' 'tests/v1/entrypoints/llm/test_struct_output_generate.py::test_guided_grammar_lark[xgrammar]'

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337
Copy link
Member

I get this error now:

tests/v1/entrypoints/llm/test_struct_output_generate.py::test_guided_json_completion[Qwen/Qwen2.5-1.5B-Instruct-xgrammar]
INFO 03-14 16:35:43 [__init__.py:32] name=register_dummy_model, value=vllm_add_dummy_model:register
INFO 03-14 16:35:43 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 03-14 16:35:43 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-14 16:35:43 [__init__.py:44] plugin register_dummy_model loaded.
WARNING 03-14 16:35:43 [arg_utils.py:1478] Setting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
INFO 03-14 16:35:57 [config.py:581] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 03-14 16:35:57 [config.py:1671] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-14 16:35:58 [core.py:53] Initializing a V1 LLM engine (v0.7.4.dev424+ga4c924b00.d20250312) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-14 16:36:00 [utils.py:2304] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f72a0ab2c70>
INFO 03-14 16:36:02 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-14 16:36:02 [cuda.py:215] Using Flash Attention backend on V1 engine.
INFO 03-14 16:36:02 [gpu_model_runner.py:1112] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
WARNING 03-14 16:36:02 [topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 03-14 16:36:03 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 03-14 16:36:03 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.49s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.49s/it]

INFO 03-14 16:36:05 [loader.py:429] Loading weights took 1.62 seconds
INFO 03-14 16:36:05 [gpu_model_runner.py:1124] Model loading took 2.8875 GB and 3.048408 seconds
INFO 03-14 16:36:20 [backends.py:409] Using cache directory: /home/cyrus/.cache/vllm/torch_compile_cache/96a4ca999c/rank_0_0 for vLLM's torch.compile
INFO 03-14 16:36:20 [backends.py:419] Dynamo bytecode transform time: 14.99 s
INFO 03-14 16:36:21 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 03-14 16:36:33 [monitor.py:33] torch.compile takes 14.99 s in total
INFO 03-14 16:36:34 [kv_cache_utils.py:537] GPU KV cache size: 409,856 tokens
INFO 03-14 16:36:34 [kv_cache_utils.py:540] Maximum concurrency for 1,024 tokens per request: 400.25x
INFO 03-14 16:37:00 [gpu_model_runner.py:1434] Graph capturing finished in 26 secs, took 1.42 GiB
INFO 03-14 16:37:00 [core.py:138] init engine (profile, create kv cache, warmup model) took 54.44 seconds
Processed prompts:   0%|                                                                                                                      | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]ERROR 03-14 16:37:04 [core.py:337] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/engine/core.py", line 330, in run_engine_core
ERROR 03-14 16:37:04 [core.py:337]     engine_core.run_busy_loop()
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/engine/core.py", line 364, in run_busy_loop
ERROR 03-14 16:37:04 [core.py:337]     outputs = step_fn()
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/engine/core.py", line 193, in step
ERROR 03-14 16:37:04 [core.py:337]     engine_core_outputs = self.scheduler.update_from_output(
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/core/scheduler.py", line 621, in update_from_output
ERROR 03-14 16:37:04 [core.py:337]     request.structured_output_request.grammar.accept_tokens(  # type: ignore[union-attr]
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/vllm/vllm/v1/structured_output/grammar.py", line 57, in accept_tokens
ERROR 03-14 16:37:04 [core.py:337]     if not self.matcher.accept_token(token):
ERROR 03-14 16:37:04 [core.py:337]   File "/home/cyrus/miniconda3/envs/vllm/lib/python3.9/site-packages/xgrammar/matcher.py", line 220, in accept_token
ERROR 03-14 16:37:04 [core.py:337]     return self._handle.accept_token(token_id, debug_print)
ERROR 03-14 16:37:04 [core.py:337] RuntimeError: [16:37:04] /project/cpp/grammar_matcher.cc:362: Check failed: (token_id >= 0 && token_id < tokenizer_info_.GetVocabSize()) is false: Invalid token id 151850 for GrammarMatcher
ERROR 03-14 16:37:04 [core.py:337] 
ERROR 03-14 16:37:04 [core.py:337] 
CRITICAL 03-14 16:37:04 [core_client.py:260] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

@DarkLight1337
Copy link
Member

This is with @russellb 's latest commits merged in

@russellb
Copy link
Member

This is with @russellb 's latest commits merged in

I'm working on this over here: #14619

still trying to fully understand the problem, but I think the tests are passing there now (they are locally, still running in CI)

@russellb
Copy link
Member

This is with @russellb 's latest commits merged in

I'm working on this over here: #14619

still trying to fully understand the problem, but I think the tests are passing there now (they are locally, still running in CI)

sorry, I'm mixing up PRs ... here: #14832

@russellb
Copy link
Member

and I think the 2 PRs are duplicates?

@DarkLight1337 DarkLight1337 removed this from the v0.8.0 milestone Mar 15, 2025
@robertgshaw2-redhat robertgshaw2-redhat deleted the enable-v1-tests branch March 24, 2025 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants