-
-
Couldn't load subscription status.
- Fork 10.9k
fix: vllm serve on Apple silicon
#17473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Right now commands like `vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0` on
Apple silicon fail with triton errors like these.
```
$ vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0
INFO 04-30 09:33:49 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-30 09:33:49 [importing.py:28] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernelcompilation.
INFO 04-30 09:33:49 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 04-30 09:33:50 [__init__.py:239] Automatically detected platform cpu.
Traceback (most recent call last):
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/bin/vllm", line 5, in <module>
from vllm.entrypoints.cli.main import main
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/entrypoints/cli/main.py", line 7, in <module>
import vllm.entrypoints.cli.benchmark.main
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/entrypoints/cli/benchmark/main.py", line 6, in <module>
import vllm.entrypoints.cli.benchmark.throughput
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/entrypoints/cli/benchmark/throughput.py", line 4, in <module>
from vllm.benchmarks.throughput import add_cli_args, main
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/benchmarks/throughput.py", line 18, in <module>
from vllm.benchmarks.datasets import (AIMODataset, BurstGPTDataset,
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/benchmarks/datasets.py", line 34, in <module>
from vllm.lora.utils import get_adapter_absolute_path
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/lora/utils.py", line 15, in <module>
from vllm.lora.fully_sharded_layers import (
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/lora/fully_sharded_layers.py", line 14, in <module>
from vllm.lora.layers import (ColumnParallelLinearWithLoRA,
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/lora/layers.py", line 29, in <module>
from vllm.model_executor.layers.logits_processor import LogitsProcessor
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/model_executor/layers/logits_processor.py", line 13, in <module>
from vllm.model_executor.layers.vocab_parallel_embedding import (
File "/Users/dxia/src/github.com/vllm-project/vllm/vllm/model_executor/layers/vocab_parallel_embedding.py", line 139, in <module>
@torch.compile(dynamic=True, backend=current_platform.simple_compile_backend)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/__init__.py", line 2543, in fn
return compile(
^^^^^^^^
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/__init__.py", line 2572, in compile
return torch._dynamo.optimize(
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 944, in optimize
return _optimize(rebuild_ctx, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 998, in _optimize
backend = get_compiler_fn(backend)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 878, in get_compiler_fn
from .repro.after_dynamo import wrap_backend_debug
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/repro/after_dynamo.py", line 35, in <module>
from torch._dynamo.debug_utils import (
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/debug_utils.py", line 44, in <module>
from torch._dynamo.testing import rand_strided
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/testing.py", line 33, in <module>
from torch._dynamo.backends.debugging import aot_eager
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/backends/debugging.py", line 35, in <module>
from functorch.compile import min_cut_rematerialization_partition
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/functorch/compile/__init__.py", line 2, in <module>
from torch._functorch.aot_autograd import (
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 26, in <module>
from torch._inductor.output_code import OutputCode
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 52, in <module>
from .runtime.autotune_cache import AutotuneCacheBundler
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/autotune_cache.py", line 23, in <module>
from .triton_compat import Config
File "/Users/dxia/src/github.com/vllm-project/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/runtime/triton_compat.py", line 16, in <module>
from triton import Config
ImportError: cannot import name 'Config' from 'triton' (unknown location)
```
We cannot install `triton` on Apple silicon because there are no [available
distributions][1].
This change adds more placeholders for triton modules and classes that are
imported when calling `vllm serve`.
[1]: https://pypi.org/project/triton/#files
Signed-off-by: David Xia <david@davidxia.com>
|
I feel this is more an inductor issue, shall we just turn off the inductor by default if we detect it's non-GPU? |
|
cc: @zou3519 thoughts? |
|
#17317 also works for me on Apple silicon. That PR looks more mature and mentions inductor so maybe this one isn't necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has overlap with #17317. Also, I mentioned this in the other PR, but I don't like inserting a dummy module into sys.modules["triton"] -- this is asking for trouble. What if triton changes, or what if a third-party library (torch) imports triton?
If we need a short-term fix we can figure something out but the IMO the right fix is to stop monkey-patching sys.modules["triton"]
|
closing in favor of #17317 |
Right now commands like
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0on Apple silicon fail with triton errors like these.vllm serveerrors onmainbranchWe cannot install
tritonon Apple silicon because there are no available distributions.This change adds more placeholders for triton modules and classes that are imported when calling
vllm serve.related Slack thread