Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Erroring out on Mac m1 with qwen-chat #328

Closed
padamshrestha opened this issue Aug 8, 2023 · 5 comments
Closed

BUG: Erroring out on Mac m1 with qwen-chat #328

padamshrestha opened this issue Aug 8, 2023 · 5 comments
Labels
bug Something isn't working gpu
Milestone

Comments

@padamshrestha
Copy link

Describe the bug

Erroring out on Mac m1 with qwen-chat

Upon running with

qwn-chat, pytorch, 7, 4-bit

root@b75edd526665:/app# xinference
INFO:xinference:Xinference successfully started. Endpoint: http://127.0.0.1:9997
INFO:xinference.core.supervisor:Worker 127.0.0.1:20355 has been added successfully
INFO:xinference.deploy.worker:Xinference worker successfully started.
Fetching 31 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:00<00:00, 73.92steps/s]
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpdguicm39
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpdguicm39/_remote_module_non_scriptable.py
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/gradio/routes.py", line 442, in run_predict
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1392, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1097, in call_function
prediction = await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 703, in wrapper
response = f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 703, in wrapper
response = f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/xinference/core/gradio.py", line 333, in select_model
model_uid = self._create_model(
File "/usr/local/lib/python3.8/dist-packages/xinference/core/gradio.py", line 60, in _create_model
return self._api.launch_model(
File "/usr/local/lib/python3.8/dist-packages/xinference/core/api.py", line 110, in launch_model
return self._isolation.call(_launch_model())
File "/usr/local/lib/python3.8/dist-packages/xinference/isolation.py", line 44, in call
return fut.result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/dist-packages/xinference/core/api.py", line 100, in _launch_model
await supervisor_ref.launch_builtin_model(
File "xoscar/core.pyx", line 288, in __pyx_actor_method_wrapper
File "xoscar/core.pyx", line 422, in _handle_actor_result
File "xoscar/core.pyx", line 465, in _run_actor_async_generator
File "xoscar/core.pyx", line 466, in xoscar.core._BaseActor._run_actor_async_generator
File "xoscar/core.pyx", line 471, in xoscar.core._BaseActor._run_actor_async_generator
File "/usr/local/lib/python3.8/dist-packages/xinference/core/supervisor.py", line 165, in launch_builtin_model
model_ref = yield worker_ref.launch_builtin_model(
File "xoscar/core.pyx", line 476, in xoscar.core._BaseActor._run_actor_async_generator
File "xoscar/core.pyx", line 396, in _handle_actor_result
File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
File "/usr/local/lib/python3.8/dist-packages/xinference/core/utils.py", line 25, in wrapped
ret = await func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/xinference/core/worker.py", line 183, in launch_builtin_model
await model_ref.load()
File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/usr/local/lib/python3.8/dist-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/usr/local/lib/python3.8/dist-packages/xoscar/api.py", line 306, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 545, in on_receive
File "xoscar/core.pyx", line 515, in xoscar.core._BaseActor.on_receive
File "xoscar/core.pyx", line 516, in xoscar.core._BaseActor.on_receive
File "xoscar/core.pyx", line 519, in xoscar.core._BaseActor.on_receive
File "/usr/local/lib/python3.8/dist-packages/xinference/core/model.py", line 86, in load
self._model.load()
File "/usr/local/lib/python3.8/dist-packages/xinference/model/llm/pytorch/core.py", line 189, in load
self._model, self._tokenizer = self._load_model(kwargs)
File "/usr/local/lib/python3.8/dist-packages/xinference/model/llm/pytorch/core.py", line 126, in _load_model
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2842, in from_pretrained
raise ValueError(
ValueError: [address=127.0.0.1:42575, pid=12626]
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

@XprobeBot XprobeBot added the gpu label Aug 8, 2023
@XprobeBot XprobeBot added this to the v0.2.0 milestone Aug 8, 2023
@UranusSeven UranusSeven changed the title Erroring out on Mac m1 with qwen-chat BUG: Erroring out on Mac m1 with qwen-chat Aug 8, 2023
@XprobeBot XprobeBot added the bug Something isn't working label Aug 8, 2023
@pangyoki
Copy link
Contributor

pangyoki commented Aug 8, 2023

Currently qwen-chat model does not support quantization on MacOS, please set the quantization attribute to none.

What's more, qwen-chat model requires the MacOS system to be in MacOS 13, because torch.sort is used in this model, and Half type of this operator is only supported in the MacOS 13 system.

@padamshrestha
Copy link
Author

Thanks for the info. I have have MacOS 13 but I'm running apps on Docker container setting quantization attribute to none didn't help as well.

@pangyoki
Copy link
Contributor

pangyoki commented Aug 8, 2023

Is your docker container an ubuntu system? If yes, the model may only run on CPU device. However, the cuda device is used by default in the linux system, resulting in an error. (The function of automatically selecting devices according to the environment will be implemented in the next version.)

It is recommended that you directly build a conda environment on the mac to use xinference, so that you can use the mps backend. Because pure cpu device runs the model very slowly.

@padamshrestha
Copy link
Author

Yes, that's correct. I have Ubuntu container on Mac m1. Looking forward the device selection feature.

@UranusSeven
Copy link
Contributor

@padamshrestha Hi, this issue has been resolve by #322 and #331 and is now available in the latest release v0.1.3!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu
Projects
None yet
Development

No branches or pull requests

4 participants