Skip to content

ggml : move LLAMAFILE/tinyBLAS into a backend #10183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ggerganov opened this issue Nov 5, 2024 · 5 comments
Closed

ggml : move LLAMAFILE/tinyBLAS into a backend #10183

ggerganov opened this issue Nov 5, 2024 · 5 comments
Labels
refactoring Refactoring

Comments

@ggerganov
Copy link
Member

The LLAMAFILE SGEMM routines are currently called directly from within ggml-cpu.c based on compile-time conditionals:

https://github.com/ggerganov/llama.cpp/blob/a9e8a9a0306a8093eef93b0022d9f45510490072/ggml/src/ggml-cpu.c#L7454-L7481

In order to simplify the logic and reduce the coupling of the different BLAS implementations, the LLAMAFILE code should be moved into a ggml backend, similar to the other BLAS implementations.

Not sure if it has to be a new backend, or if we can move it in the existing ggml-blas backend - TBD.

@ggerganov ggerganov added good first issue Good for newcomers refactoring Refactoring labels Nov 5, 2024
@ggerganov ggerganov moved this to Todo in ggml : roadmap Nov 5, 2024
@ggerganov ggerganov removed the good first issue Good for newcomers label Nov 5, 2024
@Djip007
Copy link
Contributor

Djip007 commented Nov 9, 2024

I think I'll have a try to put it in a new backend...

It as not the standart sgemm API, nor support thread inside.

@slaren
Copy link
Member

slaren commented Nov 10, 2024

It may be better to keep it in the CPU backend to avoid the overhead of stopping and starting the threads that happens when switching to a different backend.

@Djip007
Copy link
Contributor

Djip007 commented Nov 10, 2024

I'll have a look but I don't think that thread have to be (is?) started/stopped. It can be leave in thread pool.

I have create a backend that only compute matmul for FP8 test and use OpenMP thread inside, it was even faster that tinyBLAS. And that is the case for the other BLAS backend.

Never mind, I'll have a try to see how hard it is, and if I success we can bench it.
If it have slow down we can think on ggml thread API to avoid it.

Update: I have look at ggml_graph_compute and #1999 ... I need more time to have complete view on the threads part. My first impression is that maybe we should move the thread provisioning out of the CPU backend and make it usable by other backends. But I didn't spend much time analyzing it.

Update: On CPU "backend" (at least), BLAS and AMX only compute part of the graph a have there own thread management.

@Djip007 Djip007 mentioned this issue Nov 16, 2024
4 tasks
@ggerganov
Copy link
Member Author

See discussion in #10343 (comment)

@ggerganov ggerganov closed this as not planned Won't fix, can't repro, duplicate, stale Nov 17, 2024
@Djip007
Copy link
Contributor

Djip007 commented Nov 17, 2024

Start a Discussion to get a better idea of ​​how to do it
is there a better place to move forward on this topic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring
Projects
None yet
Development

No branches or pull requests

3 participants