Skip to content

Conversation

@kebe7jun
Copy link
Contributor

@kebe7jun kebe7jun commented Jun 4, 2025

Fix #19120

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

Purpose

Fixed an issue where CPU v1 mode could not be enabled on macOS.

Test Plan

vllm serve Qwen/Qwen2.5-0.5B-Instruct

Test Result

INFO 06-04 11:39:59 [config.py:822] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 06-04 11:39:59 [config.py:3199] Your device 'cpu' doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
WARNING 06-04 11:39:59 [config.py:3250] Casting torch.bfloat16 to torch.float16.
INFO 06-04 11:39:59 [config.py:1933] Defaulting to use mp for distributed inference
INFO 06-04 11:39:59 [config.py:1967] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 06-04 11:39:59 [cpu.py:135] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
/Users/kebeliu/workspace/vllm/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
INFO 06-04 11:40:05 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 06-04 11:40:05 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
INFO 06-04 11:40:08 [__init__.py:244] Automatically detected platform cpu.
INFO 06-04 11:40:13 [core.py:455] Waiting for init message from front-end.
WARNING 06-04 11:40:13 [cpu.py:135] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
INFO 06-04 11:40:13 [core.py:70] Initializing a V1 LLM engine (v0.9.1.dev345+g8e5939caf) with config: model='/Users/kebeliu/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/', speculative_config=None, tokenizer='/Users/kebeliu/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"level":2,"debug_dump_path":"","cache_dir":"","backend":"eager","custom_ops":["none","none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false,"dce":true,"size_asserts":false,"nan_asserts":false,"memory_planning":true,"epilogue_fusion":true},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 06-04 11:40:13 [shm_broadcast.py:251] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_b1adad50'), local_subscribe_addr='ipc:///var/folders/f4/fp0rrg2123nbs7c6rghvl77w0000gn/T/07bee9ee-d946-4855-9e0c-3c473795ec74', remote_subscribe_addr=None, remote_addr_ipv6=False)
/Users/kebeliu/workspace/vllm/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
/Users/kebeliu/workspace/vllm/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
INFO 06-04 11:40:17 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 06-04 11:40:17 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 06-04 11:40:17 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
WARNING 06-04 11:40:17 [importing.py:29] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
INFO 06-04 11:40:20 [__init__.py:244] Automatically detected platform cpu.
INFO 06-04 11:40:20 [__init__.py:244] Automatically detected platform cpu.
WARNING 06-04 11:40:26 [utils.py:2722] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.cpu_worker.CPUWorker object at 0x31c09d070>
WARNING 06-04 11:40:26 [utils.py:2722] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.cpu_worker.CPUWorker object at 0x30d0de070>
(VllmWorker rank=1 pid=57317) (VllmWorker rank=0 pid=57316) INFO 06-04 11:40:26 [shm_broadcast.py:251] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_96bb3b4c'), local_subscribe_addr='ipc:///var/folders/f4/fp0rrg2123nbs7c6rghvl77w0000gn/T/52915ade-4a80-4b9d-8ec9-eee67485db7f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 06-04 11:40:26 [shm_broadcast.py:251] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_c1de700d'), local_subscribe_addr='ipc:///var/folders/f4/fp0rrg2123nbs7c6rghvl77w0000gn/T/e60cdb14-153c-4ae8-a9a1-298fc4af1f44', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=57316) INFO 06-04 11:40:26 [shm_broadcast.py:251] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_05615c62'), local_subscribe_addr='ipc:///var/folders/f4/fp0rrg2123nbs7c6rghvl77w0000gn/T/0c82ec11-8443-4ca4-b15c-fbfbb8d79d5c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=57316) (VllmWorker rank=1 pid=57317) INFO 06-04 11:40:26 [parallel_state.py:1065] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-04 11:40:26 [parallel_state.py:1065] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=1 pid=57317) WARNING 06-04 11:40:26 [cpu.py:242] Pin memory is not supported on CPU.
(VllmWorker rank=0 pid=57316) WARNING 06-04 11:40:26 [cpu.py:242] Pin memory is not supported on CPU.
(VllmWorker rank=1 pid=57317) INFO 06-04 11:40:26 [cpu_model_runner.py:52] Starting to load model /Users/kebeliu/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/...
(VllmWorker rank=0 pid=57316) INFO 06-04 11:40:26 [cpu_model_runner.py:52] Starting to load model /Users/kebeliu/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B-Instruct/snapshots/7ae557604adf67be50417f59c2c2f167def9a775/...
(VllmWorker rank=1 pid=57317) (VllmWorker rank=0 pid=57316) INFO 06-04 11:40:26 [cpu.py:69] Using Torch SDPA backend.
INFO 06-04 11:40:26 [cpu.py:69] Using Torch SDPA backend.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.01it/s]
(VllmWorker rank=0 pid=57316) 
(VllmWorker rank=1 pid=57317) INFO 06-04 11:40:27 [default_loader.py:272] Loading weights took 0.99 seconds
(VllmWorker rank=0 pid=57316) INFO 06-04 11:40:27 [default_loader.py:272] Loading weights took 1.00 seconds
INFO 06-04 11:40:27 [kv_cache_utils.py:679] GPU KV cache size: 699,040 tokens
INFO 06-04 11:40:27 [kv_cache_utils.py:683] Maximum concurrency for 2,048 tokens per request: 341.33x
INFO 06-04 11:40:27 [kv_cache_utils.py:679] GPU KV cache size: 699,040 tokens
INFO 06-04 11:40:27 [kv_cache_utils.py:683] Maximum concurrency for 2,048 tokens per request: 341.33x
(VllmWorker rank=1 pid=57317) (VllmWorker rank=0 pid=57316) INFO 06-04 11:40:27 [cpu.py:69] Using Torch SDPA backend.
INFO 06-04 11:40:27 [cpu.py:69] Using Torch SDPA backend.
(VllmWorker rank=0 pid=57316) INFO 06-04 11:40:28 [cpu_model_runner.py:61] Warming up model for the compilation...
(VllmWorker rank=1 pid=57317) INFO 06-04 11:40:28 [cpu_model_runner.py:61] Warming up model for the compilation...
(VllmWorker rank=0 pid=57316) (VllmWorker rank=1 pid=57317) INFO 06-04 11:40:40 [cpu_model_runner.py:64] Warming up done.
INFO 06-04 11:40:40 [cpu_model_runner.py:64] Warming up done.
INFO 06-04 11:40:40 [core.py:171] init engine (profile, create kv cache, warmup model) took 13.27 seconds
INFO 06-04 11:40:41 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 43690
WARNING 06-04 11:40:41 [config.py:1362] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 06-04 11:40:41 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 06-04 11:40:41 [serving_completion.py:66] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 06-04 11:40:41 [api_server.py:1351] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 06-04 11:40:41 [launcher.py:29] Available routes are:
INFO 06-04 11:40:41 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /docs, Methods: HEAD, GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /health, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /load, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /ping, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /ping, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /version, Methods: GET
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /pooling, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /classify, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /score, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /rerank, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /invocations, Methods: POST
INFO 06-04 11:40:41 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [56977]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

ref: #16441

cc @bigPYJ1151

@github-actions
Copy link

github-actions bot commented Jun 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kebe7jun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request to help everyone quickly understand the changes.

This PR aims to fix a specific bug where the v1 CPU worker fails when running on macOS. The root cause appears to be related to how the maximum number of threads is determined and set for libraries like NUMEXPR, which is typically done using os.sched_getaffinity. This function is not available on macOS, leading to the failure.

The core change introduces a new helper function get_max_threads that provides an OS-aware way to determine the available threads. It uses os.sched_getaffinity on systems where it's available (like Linux) and falls back to using os.cpu_count() for macOS (Darwin). The code that sets the NUMEXPR_MAX_THREADS environment variable is then updated to use this new, more robust function, resolving the macOS compatibility issue.

Highlights

  • Bugfix: Addresses and fixes a bug causing the v1 CPU worker to fail specifically on macOS.
  • Platform Compatibility: Improves compatibility for the CPU backend on macOS by handling OS-specific differences in determining available CPU threads.
  • New Helper Function: Introduces a new utility function get_max_threads to abstract away OS-specific logic for getting thread information.

Changelog

  • vllm/platforms/cpu.py
    • Imported the platform module to check the operating system.
    • Added a new function get_max_threads(pid=0) which returns the set of available CPU IDs using os.sched_getaffinity if available, or set(range(os.cpu_count())) if the system is macOS ('Darwin'). Raises NotImplementedError for other unsupported systems.
    • Modified the check_and_update_config method to call the new get_max_threads() function instead of the direct len(os.sched_getaffinity(0)) when setting the NUMEXPR_MAX_THREADS environment variable.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Mac threads were lost,
Now platform finds the way,
CPU runs free.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the AttributeError for os.sched_getaffinity on macOS by introducing a fallback mechanism using os.cpu_count(). This is a good fix for improving cross-platform compatibility for CPU workers.

However, there's a critical issue in how the result of the new get_max_threads function is used, which needs to be addressed. Additionally, adding documentation and type hints to the new function would enhance code clarity and maintainability.

Summary of Findings

  • Incorrect NUMEXPR_MAX_THREADS Value: The NUMEXPR_MAX_THREADS environment variable is being set with the string representation of a set (e.g., "{0, 1, 2, 3}") instead of the count of threads (e.g., "4"). This is because len() is missing when calling get_max_threads().
  • Missing Documentation and Type Hints: The newly added get_max_threads function lacks a docstring and type hints, which would improve its readability and maintainability.
  • Error Message Specificity (Not Commented): The NotImplementedError in get_max_threads uses a generic "Unsupported OS" message. A more specific message like "Unsupported OS for determining CPU affinity or count." could be more informative. This was not added as a direct comment due to review settings (low severity).

Merge Readiness

This pull request makes a valuable fix for macOS compatibility. However, there is a critical issue regarding the incorrect setting of the NUMEXPR_MAX_THREADS environment variable that must be addressed before merging. Additionally, enhancing the new get_max_threads function with a docstring and type hints is recommended for better code quality.

I am unable to approve pull requests. Please ensure the critical issue is resolved, and consider the medium-severity feedback. After these changes, the PR should be in a much better state for further review and merging.

@kebe7jun kebe7jun force-pushed the fix/cpu-worker-affinity branch from bc4ae18 to 8e5939c Compare June 4, 2025 02:07
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating this. With this change, does V1 run on MacOS now? It would be good to report your testing strategy in the PR description

@kebe7jun
Copy link
Contributor Author

kebe7jun commented Jun 4, 2025

Thanks for investigating this. With this change, does V1 run on MacOS now? It would be good to report your testing strategy in the PR description

Updated the PR description.
This allows vllm to start on macOS, but it doesn't work yet because the default chunked prefill causes failures, so I haven't written a test plan. I plan to fix it in another PR (I'm working on a solution).
@mgoin

@bigPYJ1151
Copy link
Member

We should only enable CPU V1 for x86 arch because other arch have no chunked prefill op support... I forgot to consider this :(

vllm/vllm/engine/arg_utils.py

Lines 1434 to 1440 in b124e10

# Non-[CUDA, TPU] may be supported on V1, but off by default for now.
v0_hardware = not any(
(current_platform.is_cuda(), current_platform.is_tpu(),
current_platform.is_cpu()))
if v0_hardware and _warn_or_fallback( # noqa: SIM103
current_platform.device_name):
return False

@kebe7jun kebe7jun force-pushed the fix/cpu-worker-affinity branch from 8e5939c to 493089d Compare June 4, 2025 07:36
@kebe7jun
Copy link
Contributor Author

kebe7jun commented Jun 4, 2025

@bigPYJ1151 Yes, I updated the fallback logic.

@kebe7jun kebe7jun force-pushed the fix/cpu-worker-affinity branch from 493089d to 5999e75 Compare June 4, 2025 07:48
@kebe7jun kebe7jun force-pushed the fix/cpu-worker-affinity branch from 5999e75 to 1f676e5 Compare June 4, 2025 09:01
@mgoin
Copy link
Member

mgoin commented Jun 4, 2025

LGTM. Please fix the precommit so I can enable CI

Signed-off-by: Kebe <mail@kebe7jun.com>
@kebe7jun kebe7jun force-pushed the fix/cpu-worker-affinity branch from 1f676e5 to 2116c22 Compare June 4, 2025 13:29
@kebe7jun
Copy link
Contributor Author

kebe7jun commented Jun 4, 2025

LGTM. Please fix the precommit so I can enable CI

@mgoin fixed.

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Jun 4, 2025
@mgoin mgoin enabled auto-merge (squash) June 4, 2025 13:32
@mgoin mgoin merged commit ef3f98b into vllm-project:main Jun 4, 2025
77 checks passed
@kebe7jun kebe7jun deleted the fix/cpu-worker-affinity branch June 5, 2025 00:52
leoli1208 pushed a commit to leoli1208/vllm that referenced this pull request Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: CPU v1 worker run fails on macOS

3 participants