[Enhancement] Simultaneous CLBLAS/CUBLAS instances. #1494

AlphaAtlas · 2023-05-17T03:14:18Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ x] I carefully followed the README.md.
[ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ ]x I reviewed the Discussions, and have a new bug or useful enhancement to share.

Enhancement

If not already possible through a config I missed, would offloading some layers to CLBLAS and other layers to CUBLAS be viable? Or maybe offloading layers to multiple CLBLAS devices?

A common hardware config is a CPU with an IGP + discrete gpu, and this would allow the IGP to be utilized on systems with weak CPUs and low-vram dGPUs. And much more powerful, 4 channel IGPs are rumored to be in development at Intel/AMD.

With the extra transfers and possible CPU bandwidth starvation, this may or may not even improve performance much... I'm not sure.

deep-pipeline · 2023-05-19T15:48:27Z

I like the idea of this because many folk will be scraping together whatever RAM, old or new or different GPU hardware they can find to maximise VRAM and throughput / model size (and having clarity of specification would help with this as well as maybe future things like chaining across machines).

Having a clear way of specifying which layers go to which device might also help debugging any problems with code or performance on different GPUs because anyone with both could simply try relative throughout switching in different layers of model to different devices and running a test again.

AlphaAtlas · 2023-05-19T22:40:21Z

Also, while I am here, is simultaneous OpenBLAS/CUBLAS possible? I can't build with both at the same time, but it seems like OpenBLAS would be beneficial for CPU offloading unless CUBLAS is replicating that functionality.

FNsi · 2023-05-21T15:31:48Z

I don't think that will work fine though. Many copys from devices will simply reduce the speed.

AlphaAtlas · 2023-05-21T15:35:45Z

Hmmm, does CLBlast reduce generation speed on IGPs now?

I would think the transfers would be fine over 1 PCIe bus and to 1 IGP.

github-actions · 2024-04-09T01:09:02Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Simultaneous CLBLAS/CUBLAS instances. #1494

[Enhancement] Simultaneous CLBLAS/CUBLAS instances. #1494

AlphaAtlas commented May 17, 2023 •

edited

Loading

deep-pipeline commented May 19, 2023

AlphaAtlas commented May 19, 2023

FNsi commented May 21, 2023

AlphaAtlas commented May 21, 2023

github-actions bot commented Apr 9, 2024

[Enhancement] Simultaneous CLBLAS/CUBLAS instances. #1494

[Enhancement] Simultaneous CLBLAS/CUBLAS instances. #1494

Comments

AlphaAtlas commented May 17, 2023 • edited Loading

Prerequisites

Enhancement

deep-pipeline commented May 19, 2023

AlphaAtlas commented May 19, 2023

FNsi commented May 21, 2023

AlphaAtlas commented May 21, 2023

github-actions bot commented Apr 9, 2024

AlphaAtlas commented May 17, 2023 •

edited

Loading