Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the same batch size threshold for enabling OpenBLAS and disabling ggml threading #577

Merged
merged 1 commit into from
Mar 29, 2023
Merged

Conversation

Piezoid
Copy link
Contributor

@Piezoid Piezoid commented Mar 28, 2023

Commit 4640eff disabled ggml's multi-threading when OpenBLAS is used for processing large prompts.
This avoids running two thread pools at the same time.

However, OpenBLAS is used by ggml on tensors with dims >= 32, but llama.cpp only reduce the number of threads for batch size > 255.

See also this discussion: #229 (reply in thread) and issue #578

@linouxis9
Copy link

I confirm that your branch fixes the issue I had in the aforementioned discussion and in issue #578. Thank you @Piezoid!!

@ggerganov
Copy link
Member

@linouxis9 Does this improve the performance on your machine for processing the initial prompt when it is larger than 31 tokens and less than 256?

@linouxis9
Copy link

linouxis9 commented Mar 28, 2023

It's slightly faster @ggerganov than no BLAS (34s vs 40s for initial ingestion on llama-30B with the new chat example), but it heavily depends on the number of threads chosen and batch sizes. And, I'm having a hard time properly finding out the best parameters to evaluate the performance (number of BLAS and of ggml threads, batch sizes...) and monitoring the speed of each run.
In any case IMHO, this PR is needed to fix the discrepancy of check inside ggml and llama.cpp (to avoid having both ggml threads and BLAS threads spawned at the same time) when b >= 32 and b < 255 even until we find the proper b before switching to BLAS.

@rabidcopy
Copy link
Contributor

rabidcopy commented Mar 28, 2023

This seemed to provide a small but noticeable bump in performance for me.

@ggerganov ggerganov merged commit 41318d7 into ggml-org:master Mar 29, 2023
@Piezoid Piezoid deleted the oblas_thread_limit branch March 29, 2023 16:36
AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024
* Add logit_bias to the OpenAI api

* Cleanup and refactor, test in swagger.

---------

Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants