Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Support --gpu-layers #45

Closed
lindeer opened this issue Nov 22, 2023 · 7 comments
Closed

Support --gpu-layers #45

lindeer opened this issue Nov 22, 2023 · 7 comments

Comments

@lindeer
Copy link

lindeer commented Nov 22, 2023

问题:

build/bin/main -m /app/ecr/models/qwen-7b-ggml/qwen7b-ggml.bin --tiktoken /app/ecr/models/qwen-7b-ggml/qwen.tiktoken -t 6 -i
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
Welcome to Qwen.cpp! Ask whatever you want. Type 'clear' to clear context. Type 'stop' to exit.

Prompt > 三国演义都有哪些人物? 

CUDA error 2 at /home/ecr/projects/qwen.cpp/third_party/ggml/src/ggml-cuda.cu:7310: out of memory
current device: 0

可以和llama.cpp一样设定--gpu-layers的值,这样可以运行在显存不够的设备上

@fann1993814
Copy link

@lindeer
不確定這個 PR 對你來說是否有用?因為這個只實驗在 Apple Metal 上面。
但也許你可以試試看? #41

@lindeer
Copy link
Author

lindeer commented Jan 3, 2024

一样的问题 #55

@lindeer
Copy link
Author

lindeer commented Jan 3, 2024

合并到llama.cpp 已解决

@lindeer lindeer closed this as completed Jan 3, 2024
@cl886699
Copy link

合并到llama.cpp 已解决

你好,怎么合并的呢,我使用llama.cpp推理qwen输出总是会出现问题,该仓库又不知道怎么使用多gpu

@lindeer
Copy link
Author

lindeer commented Jan 26, 2024

@cl886699 出现什么问题? 一般显存不够才需要外部传入一个gpu-layers值,使用这个参数是针对GPU情况而言的,需要把llama.cpp编译成支持GPU的库

@cl886699
Copy link

@cl886699 出现什么问题? 一般显存不够才需要外部传入一个gpu-layers值,使用这个参数是针对GPU情况而言的,需要把llama.cpp编译成支持GPU的库

我用的1954版本的llama.cpp,用convert-hf-to-gguf.py转换的模型,推理时出现这样的问题,Yi模型就能正常推理。
image

@lindeer
Copy link
Author

lindeer commented Feb 2, 2024

直接在hf上下载已经转换好的模型吧,你这个中间环节太多了,没法确定是啥问题,我目测是你编译llama.cpp的时候没带上LLAMA_CUBLAS=on,只有带上这个编译选项才能编出支持GPU运行的二进制

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants