You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not sure if a Github issue is the right forum for this question, but was wondering if it's possible to use the GPU for prompt ingestion. I have an AMD GPU and with ClBlast I get about 3X faster ingestion on long prompts compared to a CPU.
But a 12-thread CPU is faster than the GPU for inference by around 30%.
Was wondering if I could combine the two so I can eat my cake and have it too!
The text was updated successfully, but these errors were encountered:
But a 12-thread CPU is faster than the GPU for inference by around 30%.
Was wondering if I could combine the two so I can eat my cake and have it too!
That is (should be) already the case! can you tell us more about your setup etc?
quote from the README:
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance.
It would be possible if the ggml executor could compute multiple nodes in parallel and choose where to do it on. Right now it can only split up one operation onto multiple threads.
Not sure if a Github issue is the right forum for this question, but was wondering if it's possible to use the GPU for prompt ingestion. I have an AMD GPU and with ClBlast I get about 3X faster ingestion on long prompts compared to a CPU.
But a 12-thread CPU is faster than the GPU for inference by around 30%.
Was wondering if I could combine the two so I can eat my cake and have it too!
The text was updated successfully, but these errors were encountered: