Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Unknown platform: darwin #6

Closed
LeeeSe opened this issue Mar 14, 2023 · 45 comments
Closed

RuntimeError: Unknown platform: darwin #6

LeeeSe opened this issue Mar 14, 2023 · 45 comments

Comments

@LeeeSe
Copy link

LeeeSe commented Mar 14, 2023

Mac support?

@qizhanchan
Copy link

qizhanchan commented Mar 14, 2023

Seems built with cuda kernels, thus no Mac

@TengLi931128
Copy link

MacBook, got a same Error.

@rjk-git
Copy link

rjk-git commented Mar 14, 2023

+1

@Sengxian
Copy link
Contributor

Currently, quantization requires the use of custom CUDA kernels. Without quantization, inference should be able to support Macs with FP16 precision. I will fix this issue later.

@wangsouc
Copy link

+1

@SirLYC
Copy link

SirLYC commented Mar 15, 2023

Please notifty us if it's fixed. Thanks!

@JokerQyou
Copy link

JokerQyou commented Mar 15, 2023

You can actually hack your way in and run the demo on macOS M1 machine. You only need to change 2 lines:

model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()

Change .half().cuda() to .float()

https://huggingface.co/THUDM/chatglm-6b/blob/9d1509a1ade49240535535aa020232c8a4b1c114/modeling_chatglm.py#L1155

Remove this line. (This file resides in huggingface cache directory, by default it's ~/.cache/huggingface/)

Notice:

  • It runs purely on CPU so the inference is relatively slow (~30 seconds for a question like "你是谁?")
  • Memory usage is doubled (of the total model size), so expect to use around 24 ~ 25 GB.

On the long term I'd expect the authors to optimize it for Apple Silicon chips (use mps for better performance).

@Raphtalia-chann
Copy link

But the problem still exist and got a text:
Backend MacOSX is interactive backend. Turning interactive mode on.
How to turn on?

@duzx16
Copy link
Member

duzx16 commented Mar 15, 2023

The model implementation without quantization does support MacOS. The error occurs because when you load a remote repo, transformers will check all the libraries imported in modeling_chatglm.py, including cpm_kernel, which doesn't support MacOS. But in fact, you don't need to import cpm_kernel if you don't use quantization.
Since I don't have access to change either transformers or cpm_kernel, I can't fix it by my side. But the solution on the user side is also simple: just clone the repo locally (git clone https://huggingface.co/THUDM/chatglm-6b) and replace "THUDM/chatglm-6b" with your local repo path.

事实上,如果不需要量化的话,我们的模型实现是支持MacOS的。会出现报错的原因是transformers在加载一个远程仓库的时候会尝试加载modeling_chatglm.py中import的所有库,包括cpm_kernel。而cpm_kernel不支持MacOS。但是事实上如果你不会用到量化的话,import cpm_kernel这行代码根本不会被用到。因为我没有权限更改transformers或者cpm_kernel,所以我没有办法在我这边解决这个问题。但是在用户侧解决这个问题其实很简单,只要手动把仓库clone下来(git clone https://huggingface.co/THUDM/chatglm-6b),然后把demo里的"THUDM/chatglm-6b"替换成你本地的目录。

@mice33
Copy link

mice33 commented Mar 16, 2023

Has anyone been able to successfully run it on a Mac?

@jeffwcx
Copy link

jeffwcx commented Mar 16, 2023

@mice33 I succeeded.
image

@jeffwcx
Copy link

jeffwcx commented Mar 16, 2023

Hardware specifications are as follows:

MacBook Pro
Model Identifier: Mac14,10
Chip: Apple M2 Pro
Total Number of Cores: 12 (8 performance and 4 efficiency)
Memory: 32 GB

My steps are as follows:

1. Download the model and modify the code

brew install git-lfs
# Clone to the `chatglm` folder
git clone https://huggingface.co/THUDM/chatglm-6b chatglm
cd chatglm
git lfs install
# This will take a long time
git lfs pull

Modify the modeling_chatglm.py file and comment out the following two lines:

--- a/modeling_chatglm.py
+++ b/modeling_chatglm.py
@@ -1166,6 +1166,6 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
         return torch.tensor(return_seqs, dtype=torch.long, device=kwargs['input_ids'].device)
 
     def quantize(self, bits: int):
-        from .quantization import quantize
-        self.transformer = quantize(self.transformer, bits)
+        # from .quantization import quantize
+        # self.transformer = quantize(self.transformer, bits)

2. Modify the demo code

 git clone https://github.com/THUDM/ChatGLM-6B.git
 cd ChatGLM-6B
 pip install -r requirements.txt
 pip install gradio

Modify web_demo.py

--- a/web_demo.py
+++ b/web_demo.py
@@ -1,8 +1,8 @@
 from transformers import AutoModel, AutoTokenizer
 import gradio as gr
 
-tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
+tokenizer = AutoTokenizer.from_pretrained("../chatglm", trust_remote_code=True)
+model = AutoModel.from_pretrained("../chatglm", trust_remote_code=True).float()
 model = model.eval()

then

python web_demo.py

After successful execution, it will display:

 running on local URL:  http://127.0.0.1:7860

@noahzark
Copy link

Succeeded using @jeffwcx 's solution, thanks for sharing. I cloned the hugging face repo into the repositories folder. And the step 1 modifying part is not neccessary since we directly use the float() and not calling quantize() function locally so cpm_kernel won't be loaded.

But as mentioned above, pure CPU is really slow, looking forward to the mps solution. So the model would be pretty "portable" (lol

The first round takes 25 seconds and the second round takes nearly 4 minutes.

使用 jeffwcx 的方案成功了,感谢分享。因为是本地加载而且不调用量化方法,所以其实可以不用修改注释里的内容(省略第一步的后半部份)。
但是和上面提到的一样,纯 CPU 运算实在是太慢了(如下图所示,第一轮25s,第二轮将近4分钟),期待能有 mps 的方案,就能带着模型出去玩了 :)

MacBook Pro
Model Identifier: Mac14,10
Chip: Apple M2 Max (12c32g)

image

@noahzark
Copy link

Tried to force using MPS backend, and got NotImplementedError: Unknown device for graph fuser. Seems warning MPS is not a fusable device

image

image

@chaucerling
Copy link

chaucerling commented Mar 17, 2023

# install nightly version of pytorch
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
// use mps backend
mps_device = torch.device("mps")
tokenizer = AutoTokenizer.from_pretrained("../chatglm", trust_remote_code=True)
model = AutoModel.from_pretrained("../chatglm", trust_remote_code=True).half()
model.to(mps_device)

The model can be initialized, but an error occurs when generating an output. It appears that some functions of PyTorch may be available on MacOS 13.3 Beta. I will try again later and hope that it will run successfully.

@chaucerling
Copy link

chaucerling commented Mar 17, 2023

After upgrading to MacOS 13.3 Beta4, I ran the cli demo but got the error NotImplementedError: Unknown device for graph fuser. This may be because the JIT fusion setting is not compatible with MPS. Therefore, I commented out the JIT setting in modeling_chatglm.py at line 34~37.

# flags required to enable jit fusion kernels
# torch._C._jit_set_profiling_mode(False)
# torch._C._jit_set_profiling_executor(False)
# torch._C._jit_override_can_fuse_on_cpu(True)
# torch._C._jit_override_can_fuse_on_gpu(True)

after running the code once aggain, I got another error invalid type: 'torch.mps.HalfTensor' at line 274. It can be fixed by editing line 268.

dtype = attention_scores.dtype # return torch.float16

After fixing above errors, the code can run without error occurred, but the output can not generate after waiting a long time. I still need some help to identify the issue.

python: 3.10.9
torch: 2.1.0.dev20230317
transformers: 4.26.1

@icowan
Copy link

icowan commented Mar 21, 2023

Hardware specifications are as follows:

MacBook Pro Model Identifier: Mac14,10 Chip: Apple M2 Pro Total Number of Cores: 12 (8 performance and 4 efficiency) Memory: 32 GB

My steps are as follows:

1. Download the model and modify the code

brew install git-lfs
# Clone to the `chatglm` folder
git clone https://huggingface.co/THUDM/chatglm-6b chatglm
cd chatglm
git lfs install
# This will take a long time
git lfs pull

Modify the modeling_chatglm.py file and comment out the following two lines:

--- a/modeling_chatglm.py
+++ b/modeling_chatglm.py
@@ -1166,6 +1166,6 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
         return torch.tensor(return_seqs, dtype=torch.long, device=kwargs['input_ids'].device)
 
     def quantize(self, bits: int):
-        from .quantization import quantize
-        self.transformer = quantize(self.transformer, bits)
+        # from .quantization import quantize
+        # self.transformer = quantize(self.transformer, bits)

2. Modify the demo code

 git clone https://github.com/THUDM/ChatGLM-6B.git
 cd ChatGLM-6B
 pip install -r requirements.txt
 pip install gradio

Modify web_demo.py

--- a/web_demo.py
+++ b/web_demo.py
@@ -1,8 +1,8 @@
 from transformers import AutoModel, AutoTokenizer
 import gradio as gr
 
-tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
+tokenizer = AutoTokenizer.from_pretrained("../chatglm", trust_remote_code=True)
+model = AutoModel.from_pretrained("../chatglm", trust_remote_code=True).float()
 model = model.eval()

then

python web_demo.py

After successful execution, it will display:

 running on local URL:  http://127.0.0.1:7860

git lfs pull 了多少G内容? 我这磁盘都快满了

@Erickrus
Copy link

Erickrus commented Mar 21, 2023

我的Mac 16G 12c, 我从ubuntu:18.04镜像开始安装了docker chatglm-6b-int4:0.1, 设置了docker CPU, RAM参数,然后启动docker
docker run -d --memory-swap -1 -v pwd:/workspace chatglm-6b-int4:0.1
可以启动,CPU拉满,SWAP也占了很多,速度惨不忍睹,可以跑出结果
good luck

@jeffwcx
Copy link

jeffwcx commented Mar 21, 2023

@icowan 大概26G

@taozhijiang
Copy link

有没有感觉,CPU没有并行计算,只有一个python占满CPU利用率,其他核心都是空跑闲置……

@fxkgwlzm
Copy link

配置:12900k加32g 系统win7

可以跑,但是慢,可以忍受

@taozhijiang
Copy link

使用上面的方法,在macOS上面是可以跑起来,但是无法并行化

import cpm_kernels
Traceback (most recent call last):
File "", line 1, in
File "/Users/xxx/miniconda3/envs/pytorch-1.7.0/lib/python3.8/site-packages/cpm_kernels/init.py", line 1, in
from . import library
File "/Users/xxx/miniconda3/envs/pytorch-1.7.0/lib/python3.8/site-packages/cpm_kernels/library/init.py", line 1, in
from . import nvrtc
File "/Users/xxx/miniconda3/envs/pytorch-1.7.0/lib/python3.8/site-packages/cpm_kernels/library/nvrtc.py", line 5, in
nvrtc = Lib("nvrtc")
File "/Users/xxx/miniconda3/envs/pytorch-1.7.0/lib/python3.8/site-packages/cpm_kernels/library/base.py", line 59, in init
raise RuntimeError("Unknown platform: %s" % sys.platform)
RuntimeError: Unknown platform: darwin

不能并行化,只有一个核跑,速度快不起来,也只能跑着玩玩而已……
PS 话说这个cpm-kernels是不开源的?

@JasonChenJC
Copy link

我们的模型实现是支持MacOS的

clone本地然后,下载模型路径还是报错

@duzx16
Copy link
Member

duzx16 commented Apr 1, 2023

我们的模型实现是支持MacOS的

clone本地然后,下载模型路径还是报错

完整的报错信息是什么?

@jussker
Copy link

jussker commented Apr 4, 2023

M2可以试一试在 webdemo.py 文件中,修改model使用的设备:
安装M2的pytorch可以参考 https://developer.apple.com/metal/pytorch/

import torch
...
# model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
# 设置使用 pytorch 的 MPS 加速
model = AutoModel.from_pretrained("./chatglm", trust_remote_code=True).quantize(8).half().to('mps')
...

@chang-zy
Copy link

chang-zy commented Apr 7, 2023

git lfs pull
Could not pull

@shenzhenjiaszj
Copy link

same as you, git lfs pull noting to pulled.

@dakele895
Copy link

苹果m1
Traceback (most recent call last):
File "/Users/dukangle/code/gpt/ChatGLM-6B/web_demo.py", line 7, in
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True).float()
File "/opt/homebrew/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 462, in from_pretrained
model_class = get_class_from_dynamic_module(
File "/opt/homebrew/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 388, in get_class_from_dynamic_module
final_module = get_cached_module_file(
File "/opt/homebrew/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 299, in get_cached_module_file
get_cached_module_file(
File "/opt/homebrew/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 269, in get_cached_module_file
modules_needed = check_imports(resolved_module_file)
File "/opt/homebrew/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 134, in check_imports
importlib.import_module(imp)
File "/opt/homebrew/Cellar/python@3.10/3.10.7/Frameworks/Python.framework/Versions/3.10/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/opt/homebrew/lib/python3.10/site-packages/cpm_kernels/init.py", line 1, in
from . import library
File "/opt/homebrew/lib/python3.10/site-packages/cpm_kernels/library/init.py", line 1, in
from . import nvrtc
File "/opt/homebrew/lib/python3.10/site-packages/cpm_kernels/library/nvrtc.py", line 5, in
nvrtc = Lib("nvrtc")
File "/opt/homebrew/lib/python3.10/site-packages/cpm_kernels/library/base.py", line 59, in init
raise RuntimeError("Unknown platform: %s" % sys.platform)
RuntimeError: Unknown platform: darwin

@duzx16
Copy link
Member

duzx16 commented Apr 8, 2023

@dakele895 #6 (comment)

@chang-zy
Copy link

chang-zy commented Apr 8, 2023

之前有个回答说要git lfs pull, 我在电脑上可以不用这一条就跑出来,其余的步骤还是按点赞最多的那个回答来.

@dakele895
Copy link

image

运算好慢

@chch1970
Copy link

chch1970 commented Apr 9, 2023

For "cpm_kernels/library/base.py RuntimeError: Unknown platform: darwin":
replace the line #59

raise RuntimeError("Unknown platform: %s" % sys.platform)

with

self.__lib = None

@juleskuehn
Copy link

I can run quantized models with CPU on a Macbook Air M2. (GPU does not work.) It is very slow though!

I followed the instructions to install pytorch nightly, but I don't think this is really necessary since we aren't using the mps backend:

conda install pytorch torchvision torchaudio -c pytorch-nightly

Clone the Huggingface repo for the quantized model (e.g. in the parent directory of this repo). You need git-lfs.

brew install git-lfs
git lfs install
git clone https://huggingface.co/THUDM/chatglm-6b-int4

Back in this repo, I had to install some additional requirements:

pip install cchardet chardet gradio

Then, to run the quantized model on CPU, just change line 6 in web_demo.py to:

model = AutoModel.from_pretrained("../chatglm-6b-int4", trust_remote_code=True).float()

@TabithaES
Copy link

TabithaES commented Apr 9, 2023

我的Mac 16G 12c, 我从ubuntu:18.04镜像开始安装了docker chatglm-6b-int4:0.1, 设置了docker CPU, RAM参数,然后启动docker docker run -d --memory-swap -1 -v pwd:/workspace chatglm-6b-int4:0.1 可以启动,CPU拉满,SWAP也占了很多,速度惨不忍睹,可以跑出结果 good luck

$ docker run -d --memory-swap -1 -v pwd:/workspace chatglm-6b-int4:0.1
Unable to find image 'chatglm-6b-int4:0.1' locally
docker: Error response from daemon: pull access denied for chatglm-6b-int4, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

帮助?помощь? help?

Ура! https://old.reddit.com/r/selfhosted/comments/11u0sot/chatglm_an_opensource_selfhosted_dialogue/ + https://github.com/hyperonym/basaran/blob/master/README.md

остаться с носом... 😢

$ docker logs -f peakji92-chatglm-6b
Loading checkpoint shards: 100%|██████████| 8/8 [00:05<00:00,  1.35it/s]
start listening on 0.0.0.0:80
ERROR:waitress:Exception while serving /v1/completions
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/usr/local/lib/python3.8/dist-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/usr/local/lib/python3.8/dist-packages/waitress/task.py", line 456, in execute
    for chunk in app_iter:
  File "/usr/local/lib/python3.8/dist-packages/werkzeug/wsgi.py", line 500, in __next__
    return self._next()
  File "/usr/local/lib/python3.8/dist-packages/werkzeug/wrappers/response.py", line 50, in _iter_encoded
    for item in iterable:
  File "/app/basaran/__main__.py", line 163, in stream
    for choice in stream_model(**options):
  File "/app/basaran/model.py", line 73, in __call__
    for (
  File "/app/basaran/model.py", line 237, in generate
    outputs = self.model(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 1160, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 973, in forward
    layer_ret = layer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 611, in forward
    attention_input = self.input_layernorm(hidden_states)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/normalization.py", line 190, in forward
    return F.layer_norm(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

@Erickrus
Copy link

```shell
docker: Error response from daemon: pull access denied for chatglm-6b-int4, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

As I mentioned, you need build the docker from scratch (ubuntu 18.04).

@pchancellor
Copy link

The model implementation without quantization does support MacOS. The error occurs because when you load a remote repo, transformers will check all the libraries imported in modeling_chatglm.py, including cpm_kernel, which doesn't support MacOS. But in fact, you don't need to import cpm_kernel if you don't use quantization. Since I don't have access to change either transformers or cpm_kernel, I can't fix it by my side. But the solution on the user side is also simple: just clone the repo locally (git clone https://huggingface.co/THUDM/chatglm-6b) and replace "THUDM/chatglm-6b" with your local repo path.

事实上,如果不需要量化的话,我们的模型实现是支持MacOS的。会出现报错的原因是transformers在加载一个远程仓库的时候会尝试加载modeling_chatglm.py中import的所有库,包括cpm_kernel。而cpm_kernel不支持MacOS。但是事实上如果你不会用到量化的话,import cpm_kernel这行代码根本不会被用到。因为我没有权限更改transformers或者cpm_kernel,所以我没有办法在我这边解决这个问题。但是在用户侧解决这个问题其实很简单,只要手动把仓库clone下来(git clone https://huggingface.co/THUDM/chatglm-6b),然后把demo里的"THUDM/chatglm-6b"替换成你本地的目录。

mac m1运行chatglm-6b-int4模型,已经替换为本地目录,但运行时会加载quantization.py文件,然后加载cpm_kernels时报错,请问这个能怎么解决呢?
Traceback (most recent call last): File "/Users/pengxu/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 18, in <module> from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up File "/opt/homebrew/lib/python3.11/site-packages/cpm_kernels/__init__.py", line 1, in <module> from . import library File "/opt/homebrew/lib/python3.11/site-packages/cpm_kernels/library/__init__.py", line 1, in <module> from . import nvrtc File "/opt/homebrew/lib/python3.11/site-packages/cpm_kernels/library/nvrtc.py", line 5, in <module> nvrtc = Lib("nvrtc") ^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/cpm_kernels/library/base.py", line 59, in __init__ raise RuntimeError("Unknown platform: %s" % sys.platform) RuntimeError: Unknown platform: darwin

@QqqingYuan
Copy link

QqqingYuan commented Apr 25, 2023

Mac user can modify chatglm-6b-int4/quantization.py to fix these errors:

  1. cpm_kernel not support MacOS now,so delete line 17 to line 43 and just remain
    kernels = None

  2. Apple's gcc does not support OpenMP, you should install llvm and libomp :
    brew install llvm libomp
    and modify gcc compiler option in line 140 . But there're some bugs and don't work with OpenMP. so
    just comment out or delete line 140 to line 154.

@kannayshu
Copy link

Mac
3.1 GHz 双核Intel Core i7
16 GB 1867 MHz DDR3
Intel Iris Graphics 6100 1536 MB
挺一般的配置也能跑起来

按照上面最高赞jeffwcx的方法成功啦,感谢~
我用的chatglm-6b-int4,第一点的提到的modeling_chatglm.py 那里不用更改,同时我注释掉了modeling_chatglm.py里的第18到46行,关于kernels的
然后第二点 web_demo.py 里面的模型地址换成你本地模型存放地址,别直接硬搬

@fangkuann
Copy link

fangkuann commented May 8, 2023

M2可以试一试在 webdemo.py 文件中,修改model使用的设备: 安装M2的pytorch可以参考 https://developer.apple.com/metal/pytorch/

import torch
...
# model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
# 设置使用 pytorch 的 MPS 加速
model = AutoModel.from_pretrained("./chatglm", trust_remote_code=True).quantize(8).half().to('mps')
...

这个MPS只有M2机器可以跑的通吗?我在M1 16G内存的机器上,一直卡在 Loading checkpoint shards: 100%。

MODEL_PATH="../chatglm-6b" // no quantization
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).quantize(8).half().to('mps')

Dependences

  • python 3.9.6
  • torch 2.1.0.dev20230508
  • transformers 4.27.1
  • MacOS 13.3 (22E252)

@icowan
Copy link

icowan commented May 10, 2023

Hardware specifications are as follows:

MacBook Pro Model Identifier: Mac14,10 Chip: Apple M2 Pro Total Number of Cores: 12 (8 performance and 4 efficiency) Memory: 32 GB

My steps are as follows:

1. Download the model and modify the code

brew install git-lfs
# Clone to the `chatglm` folder
git clone https://huggingface.co/THUDM/chatglm-6b chatglm
cd chatglm
git lfs install
# This will take a long time
git lfs pull

Modify the modeling_chatglm.py file and comment out the following two lines:

--- a/modeling_chatglm.py
+++ b/modeling_chatglm.py
@@ -1166,6 +1166,6 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
         return torch.tensor(return_seqs, dtype=torch.long, device=kwargs['input_ids'].device)
 
     def quantize(self, bits: int):
-        from .quantization import quantize
-        self.transformer = quantize(self.transformer, bits)
+        # from .quantization import quantize
+        # self.transformer = quantize(self.transformer, bits)

2. Modify the demo code

 git clone https://github.com/THUDM/ChatGLM-6B.git
 cd ChatGLM-6B
 pip install -r requirements.txt
 pip install gradio

Modify web_demo.py

--- a/web_demo.py
+++ b/web_demo.py
@@ -1,8 +1,8 @@
 from transformers import AutoModel, AutoTokenizer
 import gradio as gr
 
-tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
+tokenizer = AutoTokenizer.from_pretrained("../chatglm", trust_remote_code=True)
+model = AutoModel.from_pretrained("../chatglm", trust_remote_code=True).float()
 model = model.eval()

then

python web_demo.py

After successful execution, it will display:

 running on local URL:  http://127.0.0.1:7860

感谢!m2,64G. 跑起来了,速度还行几个词几个词蹦

@Doubledongli
Copy link

报错:curl: (7) Failed to connect to raw.githubusercontent.com port 443: Connection refused

@BryanOne
Copy link

M2可以试一试在 webdemo.py 文件中,修改model使用的设备: 安装M2的pytorch可以参考 https://developer.apple.com/metal/pytorch/

import torch
...
# model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
# 设置使用 pytorch 的 MPS 加速
model = AutoModel.from_pretrained("./chatglm", trust_remote_code=True).quantize(8).half().to('mps')
...

这个MPS只有M2机器可以跑的通吗?我在M1 16G内存的机器上,一直卡在 Loading checkpoint shards: 100%。

MODEL_PATH="../chatglm-6b" // no quantization
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).quantize(8).half().to('mps')

Dependences

  • python 3.9.6
  • torch 2.1.0.dev20230508
  • transformers 4.27.1
  • MacOS 13.3 (22E252)

同M1 16G,相同报错,感觉是内存不够

@chenbt-hz
Copy link

Hardware specifications are as follows:

MacBook Pro Model Identifier: Mac14,10 Chip: Apple M2 Pro Total Number of Cores: 12 (8 performance and 4 efficiency) Memory: 32 GB

My steps are as follows:

1. Download the model and modify the code

brew install git-lfs
# Clone to the `chatglm` folder
git clone https://huggingface.co/THUDM/chatglm-6b chatglm
cd chatglm
git lfs install
# This will take a long time
git lfs pull

Modify the modeling_chatglm.py file and comment out the following two lines:

--- a/modeling_chatglm.py
+++ b/modeling_chatglm.py
@@ -1166,6 +1166,6 @@ class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
         return torch.tensor(return_seqs, dtype=torch.long, device=kwargs['input_ids'].device)
 
     def quantize(self, bits: int):
-        from .quantization import quantize
-        self.transformer = quantize(self.transformer, bits)
+        # from .quantization import quantize
+        # self.transformer = quantize(self.transformer, bits)

2. Modify the demo code

 git clone https://github.com/THUDM/ChatGLM-6B.git
 cd ChatGLM-6B
 pip install -r requirements.txt
 pip install gradio

Modify web_demo.py

--- a/web_demo.py
+++ b/web_demo.py
@@ -1,8 +1,8 @@
 from transformers import AutoModel, AutoTokenizer
 import gradio as gr
 
-tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
+tokenizer = AutoTokenizer.from_pretrained("../chatglm", trust_remote_code=True)
+model = AutoModel.from_pretrained("../chatglm", trust_remote_code=True).float()
 model = model.eval()

then

python web_demo.py

After successful execution, it will display:

 running on local URL:  http://127.0.0.1:7860

我是本周拉取的模型,按这个步骤无法运行/chatglm-6b-int4起来。/chatglm-6b正常
环境:
MacOS 12.4
Python 3.10.6

最终修改以下代码运行成功
(异常提示还是有的,但是添加了ignore_mismatched_sizes=True参数最终可以运行):
web_demo.py:
tokenizer = AutoTokenizer.from_pretrained("./chatglm-6b-int4", trust_remote_code=True) model = AutoModel.from_pretrained("./chatglm-6b-int4", trust_remote_code=True,ignore_mismatched_sizes=True).float()
./chatglm-6b-int4/modeling_chatglm.py
`
def quantize(self, bits: int, quantize_embeddings=False, use_quantization_cache=False, empty_init=False, **kwargs):
if bits == 0:
return

    #from .quantization import quantize, QuantizedEmbedding, QuantizedLinear, load_cpu_kernel

    # if self.quantized:
    #     if self.device == torch.device("cpu"):
    #         logger.info("Already quantized, reloading cpu kernel.")
    #         load_cpu_kernel(**kwargs)
    #     else:
    #         logger.info("Already quantized.")
    #     return self

    self.quantized = True

    self.config.quantization_bit = bits
    self.config.quantization_embeddings = quantize_embeddings

    # self.transformer = quantize(self.transformer, bits, use_quantization_cache=use_quantization_cache, empty_init=empty_init, **kwargs)

    if self.device == torch.device("cpu"):
        dtype = torch.float32
    else:
        dtype = torch.half

    # if quantize_embeddings:
    #     logger.info("Applying quantization to embeddings")
    #     self.transformer.word_embeddings = QuantizedEmbedding(
    #         weight_bit_width=bits,
    #         weight_tensor=self.transformer.word_embeddings.weight.to(self.device),
    #         num_embeddings=self.transformer.word_embeddings.num_embeddings,
    #         embedding_dim=self.transformer.word_embeddings.embedding_dim,
    #         dtype=dtype,
    #         empty_init=empty_init,
    #         device=self.transformer.word_embeddings.weight.device,
    #     )
    #     self.lm_head = QuantizedLinear(
    #         weight_bit_width=bits,
    #         weight_tensor=self.lm_head.weight.to(self.device),
    #         bias_tensor=None,
    #         in_features=self.lm_head.in_features,
    #         out_features=self.lm_head.out_features,
    #         bias=False,
    #         quantized_weight=self.transformer.word_embeddings.weight,
    #         quantized_weight_scale=self.transformer.word_embeddings.weight_scale,
    #         dtype=dtype,
    #         empty_init=empty_init,
    #         device=self.lm_head.weight.device,
    #     )

    return self

`

@quhongwei
Copy link

Mac 3.1 GHz 双核Intel Core i7 16 GB 1867 MHz DDR3 Intel Iris Graphics 6100 1536 MB 挺一般的配置也能跑起来

按照上面最高赞jeffwcx的方法成功啦,感谢~ 我用的chatglm-6b-int4,第一点的提到的modeling_chatglm.py 那里不用更改,同时我注释掉了modeling_chatglm.py里的第18到46行,关于kernels的 然后第二点 web_demo.py 里面的模型地址换成你本地模型存放地址,别直接硬搬

能贴下第18到46行具体是什么代码吗?

@championcp
Copy link

请问这个问题有确定的解决方案了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests