Use the same batch size threshold for enabling OpenBLAS and disabling ggml threading #577

Piezoid · 2023-03-28T14:58:28Z

Commit 4640eff disabled ggml's multi-threading when OpenBLAS is used for processing large prompts.
This avoids running two thread pools at the same time.

However, OpenBLAS is used by ggml on tensors with dims >= 32, but llama.cpp only reduce the number of threads for batch size > 255.

See also this discussion: #229 (reply in thread) and issue #578

linouxis9 · 2023-03-28T18:55:23Z

I confirm that your branch fixes the issue I had in the aforementioned discussion and in issue #578. Thank you @Piezoid!!

ggerganov · 2023-03-28T19:32:09Z

@linouxis9 Does this improve the performance on your machine for processing the initial prompt when it is larger than 31 tokens and less than 256?

linouxis9 · 2023-03-28T20:51:26Z

It's slightly faster @ggerganov than no BLAS (34s vs 40s for initial ingestion on llama-30B with the new chat example), but it heavily depends on the number of threads chosen and batch sizes. And, I'm having a hard time properly finding out the best parameters to evaluate the performance (number of BLAS and of ggml threads, batch sizes...) and monitoring the speed of each run.
In any case IMHO, this PR is needed to fix the discrepancy of check inside ggml and llama.cpp (to avoid having both ggml threads and BLAS threads spawned at the same time) when b >= 32 and b < 255 even until we find the proper b before switching to BLAS.

rabidcopy · 2023-03-28T21:25:34Z

This seemed to provide a small but noticeable bump in performance for me.

* Add logit_bias to the OpenAI api * Cleanup and refactor, test in swagger. --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

Use the same threshold for OpenBLAS and ggml thread limiting

25248d7

rabidcopy mentioned this pull request Mar 29, 2023

Add support for memory mapping models #586

Closed

4 tasks

ggerganov approved these changes Mar 29, 2023

View reviewed changes

ggerganov merged commit 41318d7 into ggml-org:master Mar 29, 2023

Piezoid deleted the oblas_thread_limit branch March 29, 2023 16:36

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the same batch size threshold for enabling OpenBLAS and disabling ggml threading #577

Use the same batch size threshold for enabling OpenBLAS and disabling ggml threading #577

Piezoid commented Mar 28, 2023 •

edited

Loading

linouxis9 commented Mar 28, 2023

ggerganov commented Mar 28, 2023

linouxis9 commented Mar 28, 2023 •

edited

Loading

rabidcopy commented Mar 28, 2023 •

edited

Loading

Use the same batch size threshold for enabling OpenBLAS and disabling ggml threading #577

Use the same batch size threshold for enabling OpenBLAS and disabling ggml threading #577

Conversation

Piezoid commented Mar 28, 2023 • edited Loading

linouxis9 commented Mar 28, 2023

ggerganov commented Mar 28, 2023

linouxis9 commented Mar 28, 2023 • edited Loading

rabidcopy commented Mar 28, 2023 • edited Loading

Piezoid commented Mar 28, 2023 •

edited

Loading

linouxis9 commented Mar 28, 2023 •

edited

Loading

rabidcopy commented Mar 28, 2023 •

edited

Loading