Replies: 4 comments 1 reply
-
Try using less threads |
Beta Was this translation helpful? Give feedback.
-
You can specify different number of threads to use during processing and generating. Please check blasthreads on the wiki |
Beta Was this translation helpful? Give feedback.
-
Setting it to 'max threads' will slow it down, in my experience. Try like, four. (0, 2, 4, 6) |
Beta Was this translation helpful? Give feedback.
-
I just noticed that koboldcpp 1.62 fixed this 🙂 So some newer commit in llamacpp must have addressed this behavior again. So for anyone reading this, it's very much worth it to update over this bug fix. |
Beta Was this translation helpful? Give feedback.
-
Hi, basically what I noticed is that since version 1.48.1, the program uses all CPU cores now during the first prompt ingestion, with seemingly no performance benefit when using cuBLAS . With version 1.47.2, it only uses a few CPU cores during prompt ingestion. As the GPU does the primary work with cuBLAS during the initial prompt ingestion, I have not seen any speed benefit from all CPU cores being used in the new versions during that process.
So I got curious to ask, was there a change made to use all cores anyway to speed up the process a bit? Or is it a bug? Since the GPU is much faster and full CPU usage only needed after the prompt ingestion (generating new text), it seems to me that it may be unintended, as it unnecessarily increases power usage. I tested as well to intentionally throttle my processor, and it barely changed the initial prompt ingestion speed on the GPU, so I'm currently finding this unusual.
Just curious of course, the program works just fine still with this behavior. Thank you!
To add additional information:
Operating system: Windows 10 22H2
Model used: openhermes-2.5-mistral-7b.Q4_K_M.gguf
Settings from setting file:
{"model": null, "model_param": "E:/models/openhermes-2.5-mistral-7b.Q4_K_M.gguf", "port": 5001, "port_param": 5001, "host": "", "launch": false, "lora": null, "config": null, "threads": 8, "blasthreads": 8, "highpriority": false, "contextsize": 4096, "blasbatchsize": 256, "ropeconfig": [0.0, 10000.0], "smartcontext": false, "noshift": false, "bantokens": null, "forceversion": 0, "nommap": false, "usemlock": false, "noavx2": false, "debugmode": 0, "skiplauncher": false, "hordeconfig": null, "noblas": false, "useclblast": null, "usecublas": ["normal", "0"], "gpulayers": 0, "tensor_split": null, "onready": "", "multiuser": false, "remotetunnel": false, "foreground": false}
Beta Was this translation helpful? Give feedback.
All reactions