Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/Help] Windows 下 INT-4 量化模型无法加载 #162

Closed
1 task done
OedoSoldier opened this issue Mar 19, 2023 · 9 comments
Closed
1 task done

[BUG/Help] Windows 下 INT-4 量化模型无法加载 #162

OedoSoldier opened this issue Mar 19, 2023 · 9 comments

Comments

@OedoSoldier
Copy link
Contributor

OedoSoldier commented Mar 19, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Compiling kernels : C:\Users\{}\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.c
Compiling gcc -O3 -pthread -fopenmp -std=c99 C:\Users\{}\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.c -shared -o C:\Users\{}\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.so
Kernels compiled : C:\Users\{}\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.so
Traceback (most recent call last):
  File "{}\main.py", line 33, in <module>
    model = AutoModel.from_pretrained(args.path, trust_remote_code=True).float()
  File "C:\Users\{}\anaconda3\envs\lora\lib\site-packages\transformers\models\auto\auto_factory.py", line 466, in from_pretrained
    return model_class.from_pretrained(
  File "C:\Users\{}\anaconda3\envs\lora\lib\site-packages\transformers\modeling_utils.py", line 2498, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "C:\Users\{}/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\modeling_chatglm.py", line 940, in __init__
    self.quantize(self.config.quantization_bit, self.config.quantization_embeddings, use_quantization_cache=True, empty_init=True)
  File "C:\Users\{}/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\modeling_chatglm.py", line 1277, in quantize
    self.transformer = quantize(self.transformer, bits, use_quantization_cache=use_quantization_cache, empty_init=empty_init, **kwargs)
  File "C:\Users\{}/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization.py", line 399, in quantize
    load_cpu_kernel(**kwargs)
  File "C:\Users\{}/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization.py", line 386, in load_cpu_kernel
    cpu_kernels = CPUKernel(**kwargs)
  File "C:\Users\{}/.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization.py", line 137, in __init__
    kernels = ctypes.CDLL(kernel_file, winmode=0)
  File "C:\Users\{}\anaconda3\envs\lora\lib\ctypes\__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\{}\.cache\huggingface\modules\transformers_modules\chatglm-6b-int4\quantization_kernels_parallel.so' (or one of its dependencies). Try using the full path with constructor syntax.

Expected Behavior

No response

Steps To Reproduce

在 Windows 下加载 INT-4 量化模型,显示 CPU kernal 编译成功,但无法加载已编译的 kernal。经检查 quantization_kernels_parallel.so 成功编译,且使用 os.path.exists() 检测文件也返回 True

在 WSL 下一切正常。

Environment

- OS: Windows 11
- Python: 3.10.6
- Transformers: 4.27.1
- PyTorch: 1.13.1+cu117
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

Anything else?

No response

@kenneth104
Copy link

不知道这个对开发人员有没有帮助
这边想在CPU int4下加载,但提示没有cpu kernel

设置如下:
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True).float()

启动如下:
venv\Scripts\activate && streamlit run web_demo2.py --server.port 6006

Explicitly passing a revisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevisionis encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. No compiled kernel found. Compiling kernels : C:\Users\username\.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4\a93efa90f5b012b13a1197b9f47835b8ef1cc307\quantization_kernels_parallel.c Compiling gcc -O3 -pthread -fopenmp -std=c99 C:\Users\username\.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4\a93efa90f5b012b13a1197b9f47835b8ef1cc307\quantization_kernels_parallel.c -shared -o C:\Users\username\.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4\a93efa90f5b012b13a1197b9f47835b8ef1cc307\quantization_kernels_parallel.so Kernels compiled : C:\Users\username\.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4\a93efa90f5b012b13a1197b9f47835b8ef1cc307\quantization_kernels_parallel.so Cannot load cpu kernel, don't use quantized model on cpu. Using quantization cache Applying quantization to glm layers

@songxxzp
Copy link
Collaborator

songxxzp commented Mar 20, 2023

不知道这个对开发人员有没有帮助 这边想在CPU int4下加载,但提示没有cpu kernel

设置如下: model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True).float()

启动如下: venv\Scripts\activate && streamlit run web_demo2.py --server.port 6006

Explicitly passing a revisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevisionis encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. No compiled kernel found. Compiling kernels : C:\Users\username\.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4\a93efa90f5b012b13a1197b9f47835b8ef1cc307\quantization_kernels_parallel.c Compiling gcc -O3 -pthread -fopenmp -std=c99 C:\Users\username\.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4\a93efa90f5b012b13a1197b9f47835b8ef1cc307\quantization_kernels_parallel.c -shared -o C:\Users\username\.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4\a93efa90f5b012b13a1197b9f47835b8ef1cc307\quantization_kernels_parallel.so Kernels compiled : C:\Users\username\.cache\huggingface\modules\transformers_modules\THUDM\chatglm-6b-int4\a93efa90f5b012b13a1197b9f47835b8ef1cc307\quantization_kernels_parallel.so Cannot load cpu kernel, don't use quantized model on cpu. Using quantization cache Applying quantization to glm layers

请确保已安装gcc和openmp。
可以自行编译kernel(先尝试quantization_kernels.c,quantization_kernels_parallel.c需要openmp)

gcc -fPIC -std=c99 quantization_kernels.c -shared -o quantization_kernels.so
gcc -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so

编译时加入-O3会极大的加速,但在某些平台上可能造成错误,请根据情况自行添加优化参数。
然后在原先模型加载后手动加载一下手动编译的kernel:

model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True).float()
model = model.quantize(bits=4, kernel_file="Your Kernel Path")

推测错误原因有可能是没有openmp无法加载并行的kernel,也有可能是路径过于复杂ctypes没有正确处理。
另外请检查杀毒软件。

@kenneth104
Copy link

@songxxzp

非常感谢你的帮助,这边更换到Linux平台可以使用了

@fxb392
Copy link

fxb392 commented Apr 3, 2023

No compiled kernel found.
Compiling kernels : /root/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c
Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /root/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c -shared -o /root/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.so
Compile failed, using default cpu kernel code.
Compiling gcc -O3 -fPIC -std=c99 /root/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.c -shared -o /root/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so
Kernels compiled : /root/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so
Cannot load cpu kernel, don't use quantized model on cpu.
Using quantization cache
Applying quantization to glm layers
欢迎使用 ChatGLM-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序
根据@songxxzp#183 (comment) 的帮助改好了,感谢!
总结一下,报错是编译 kernal失败
(1)手动编译,在模型path下

gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so
gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels.c -shared -o quantization_kernels.so

(2)然后在原先模型加载后手动加载一下手动编译的kernel

model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True).float()
model = model.quantize(bits=4, kernel_file="Your Kernel Path")

还是会报编译错误,但是已经可以使用了。

@sgb25sgb
Copy link

报编译错误,但可以使用这是什么情况

@fxb392
Copy link

fxb392 commented Apr 13, 2023

@sgb25sgb 使用默认的加载cpu kernel 失败,但是model = model.quantize(bits=4, kernel_file="Your Kernel Path")加载的cpu kernel 成功了

@sgb25sgb
Copy link

fxb392 谢谢你!

@undo-nothing
Copy link

@deapge
Copy link

deapge commented Jul 17, 2023

试下这个:
在你下载的源码 chatglm-6b-int4/quantization.py 的文件中,搜索找到这样三行
kernels = ctypes.cdll.LoadLibrary(kernel_file)
都把它们都改成kernels = ctypes.CDLL(kernel_file,winmode=0)

修改截图

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants