-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Erroring out on Mac m1 with qwen-chat #328
Comments
Currently qwen-chat model does not support quantization on MacOS, please set the What's more, qwen-chat model requires the MacOS system to be in |
Thanks for the info. I have have MacOS 13 but I'm running apps on Docker container setting quantization attribute to none didn't help as well. |
Is your docker container an ubuntu system? If yes, the model may only run on CPU device. However, the cuda device is used by default in the linux system, resulting in an error. (The function of automatically selecting devices according to the environment will be implemented in the next version.) It is recommended that you directly build a conda environment on the mac to use xinference, so that you can use the |
Yes, that's correct. I have Ubuntu container on Mac m1. Looking forward the device selection feature. |
@padamshrestha Hi, this issue has been resolve by #322 and #331 and is now available in the latest release v0.1.3! |
Describe the bug
Erroring out on Mac m1 with qwen-chat
Upon running with
qwn-chat, pytorch, 7, 4-bit
root@b75edd526665:/app# xinference
INFO:xinference:Xinference successfully started. Endpoint: http://127.0.0.1:9997
INFO:xinference.core.supervisor:Worker 127.0.0.1:20355 has been added successfully
INFO:xinference.deploy.worker:Xinference worker successfully started.
Fetching 31 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:00<00:00, 73.92steps/s]
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpdguicm39
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpdguicm39/_remote_module_non_scriptable.py
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/gradio/routes.py", line 442, in run_predict
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1392, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1097, in call_function
prediction = await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 703, in wrapper
response = f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 703, in wrapper
response = f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/xinference/core/gradio.py", line 333, in select_model
model_uid = self._create_model(
File "/usr/local/lib/python3.8/dist-packages/xinference/core/gradio.py", line 60, in _create_model
return self._api.launch_model(
File "/usr/local/lib/python3.8/dist-packages/xinference/core/api.py", line 110, in launch_model
return self._isolation.call(_launch_model())
File "/usr/local/lib/python3.8/dist-packages/xinference/isolation.py", line 44, in call
return fut.result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/dist-packages/xinference/core/api.py", line 100, in _launch_model
await supervisor_ref.launch_builtin_model(
File "xoscar/core.pyx", line 288, in __pyx_actor_method_wrapper
File "xoscar/core.pyx", line 422, in _handle_actor_result
File "xoscar/core.pyx", line 465, in _run_actor_async_generator
File "xoscar/core.pyx", line 466, in xoscar.core._BaseActor._run_actor_async_generator
File "xoscar/core.pyx", line 471, in xoscar.core._BaseActor._run_actor_async_generator
File "/usr/local/lib/python3.8/dist-packages/xinference/core/supervisor.py", line 165, in launch_builtin_model
model_ref = yield worker_ref.launch_builtin_model(
File "xoscar/core.pyx", line 476, in xoscar.core._BaseActor._run_actor_async_generator
File "xoscar/core.pyx", line 396, in _handle_actor_result
File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
File "/usr/local/lib/python3.8/dist-packages/xinference/core/utils.py", line 25, in wrapped
ret = await func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/xinference/core/worker.py", line 183, in launch_builtin_model
await model_ref.load()
File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/usr/local/lib/python3.8/dist-packages/xoscar/api.py", line 306, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 545, in on_receive
File "xoscar/core.pyx", line 515, in xoscar.core._BaseActor.on_receive
File "xoscar/core.pyx", line 516, in xoscar.core._BaseActor.on_receive
File "xoscar/core.pyx", line 519, in xoscar.core._BaseActor.on_receive
File "/usr/local/lib/python3.8/dist-packages/xinference/core/model.py", line 86, in load
self._model.load()
File "/usr/local/lib/python3.8/dist-packages/xinference/model/llm/pytorch/core.py", line 189, in load
self._model, self._tokenizer = self._load_model(kwargs)
File "/usr/local/lib/python3.8/dist-packages/xinference/model/llm/pytorch/core.py", line 126, in _load_model
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2842, in from_pretrained
raise ValueError(
ValueError: [address=127.0.0.1:42575, pid=12626]
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set
load_in_8bit_fp32_cpu_offload=True
and pass a customdevice_map
tofrom_pretrained
. Checkhttps://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.
The text was updated successfully, but these errors were encountered: