Skip to content

Commit

Permalink
Merge pull request #1 from yfyang86/mac-m1-dev
Browse files Browse the repository at this point in the history
 [Document] 更新Mac部署
  • Loading branch information
yfyang86 authored May 3, 2023
2 parents baec759 + b13f1a6 commit 16b3a43
Show file tree
Hide file tree
Showing 2 changed files with 119 additions and 2 deletions.
60 changes: 58 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,14 +191,70 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True

如果遇到了报错 `Could not find module 'nvcuda.dll'` 或者 `RuntimeError: Unknown platform: darwin` (MacOS) ,请[从本地加载模型](README.md#从本地加载模型)

### Mac 上的 CPU 部署和加速

Mac直接加载量化后的模型会出现问题(可运行但是单核),这是由于Mac由于本身缺乏omp导致的。

```sh
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
```

[chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4)量化模型为例,需要做如下配置:

1. 安装`libomp`;
2. 配置`gcc`编译项。

```bash
# 第一步: 参考`https://mac.r-project.org/openmp/`
## 假设gcc -v是14.x版本,其他版本见R-Project提供的表格
curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
## 此时会安装下面几个文件:
# usr/local/lib/libomp.dylib
# usr/local/include/ompt.h
# usr/local/include/omp.h
# usr/local/include/omp-tools.h
```

针对`chatglm-6b-int4`, 修改[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py),主要是把硬编码的`gcc -O3 -fPIC -pthread -fopenmp -std=c99`命令修改成`gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`[对应代码](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)见下:

```python
# 第二步
## 找到包含`gcc -O3 -fPIC -pthread -fopenmp -std=c99`的这一行
## 修改成
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file)
```

为了兼容性,也能写成
```python
## 在最开始增加一个包
import platform
## ...
## 上述相应部分修改为(请自行改一下缩进):
if platform.uname()[0] == 'Darwin':
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99-o {}".format(
source_code, kernel_file)
else:
compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format(
source_code, kernel_file)
```

> 注意:如果你之前运行过失败过,最好清一下Huggingface的缓存,i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`。由于使用了`rm`命令,请明确知道自己在删除什么。
### Mac 上的 GPU 加速
对于搭载了Apple Silicon的Mac(以及MacBook),可以使用 MPS 后端来在 GPU 上运行 ChatGLM-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly。

目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载,并使用 mps 后端
目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载,并使用 mps 后端
```python
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps')
```
即可使用在 Mac 上使用 GPU 加速模型推理。
即可使用在 Mac 上使用 GPU 加速模型推理。如果出现关于`half`的报错(比如在MacOS 13.3.x上),可以改成:
```python
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps')
```

> 注意:上述方法在非量化版中,运行没有问题。量化版模型在MPS设备运行可以关注[这个](https://github.com/THUDM/ChatGLM-6B/issues/462)ISSUE,这主要是[kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27)的原因,可以解包这个`ELF`文件看到是CUDA的实现。
### 多卡部署
如果你有多张 GPU,但是每张 GPU 的显存大小都不足以容纳完整的模型,那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`,然后通过如下方法加载模型:
Expand Down
61 changes: 61 additions & 0 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,15 +188,76 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=Tru

If your encounter the error `Could not find module 'nvcuda.dll'` or `RuntimeError: Unknown platform: darwin`(MacOS), please [load the model locally](README_en.md#load-the-model-locally).


### CPU Deployment on Mac

The default Mac enviroment does not support Openmp. One may encounter such warning/errors when execute the `AutoModel.from_pretrained(...)` command:

```sh
clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
```

Take the quantified int4 version [chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4) for example, the following extra steps are needed:

1. Install `libomp`;
2. Configure `gcc`.

```bash
# STEP 1: install libopenmp, check `https://mac.r-project.org/openmp/` for details
## Assumption: `gcc -v >= 14.x`, read the R-Poject before run the code:
curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
## Four files are installed:
# usr/local/lib/libomp.dylib
# usr/local/include/ompt.h
# usr/local/include/omp.h
# usr/local/include/omp-tools.h
```

For `chatglm-6b-int4`, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)file. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`[corresponding python code](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168), i.e.:

```python
# STEP
## Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file)
```

For production code, one could use `platform` library to make it neater:

```python
## import platform just after `import os`
import platform
## ...
## change the corresponding lines to:
if platform.uname()[0] == 'Darwin':
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99-o {}".format(
source_code, kernel_file)
else:
compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format(
source_code, kernel_file)
```

> Notice: If you have run the `chatglm` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
### GPU Inference on Mac
For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly.

Currently you must [load the model locally](README_en.md#load-the-model-locally) on MacOS. Change the code to load the model from your local path, and use the mps backend:
```python
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps')
```

For Mac users with Mac >= 13.3, one may encounter errors related to `half()` method. Use `float()` instead:

```python
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps')
```

Then you can use GPU-accelerated model inference on Mac.

> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unzip/unpack [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`.
### Multi-GPU Deployment
If you have multiple GPUs, but the memory size of each GPU is not sufficient to accommodate the entire model, you can split the model across multiple GPUs.

Expand Down

0 comments on commit 16b3a43

Please sign in to comment.