add llama_matmul_demo2_bf16.c with other parallelize experiment #586

Djip007 · 2024-10-12T23:11:50Z

You was not too far from good speed.

But if you look at BLIS paper, bloc_A need to be keep in L2 cache, on x86 CPU (zen ...) there is 1 L2 cache per core, so each bloc compute on a core need is own bloc_A.

With this demo I have speed up of 1.77 on my zen4 8 core ( AMD Ryzen 9 7940HS )

Note: there is more to do for best perf:

B is broadcast so it is not needed to transpose it I think.
bloc_B need to be keep in L3 cache, it is the case on my 8 core zen4, but not on the 16+ core zen4 and zen2 (1 L3 for 4 core...) for best we can have bloc_B per L3 cache
Next, if N in big enough we can parallelize on first loop

for (int j = ith * NC; j < N; j += NT) { 
    [...]
}

with this one we keep bloc_A on L2 cache bloc_B is shared and keep in L3/L1 cache

Djip007 · 2024-10-12T23:22:08Z

I finally found some time to make some "comments" on this branch.

I do not change existing code, juste add a demo to show what we can have.
It may not be perfectly clean, but hope useful for this experiment.

Note: I use this 5 loop for my fp8 branch. but for best performance I repack A on weight load so no more need to do it in the compute part. But it need to create a completly new backend for have control on the backend_buffer...

Djip007 · 2024-10-13T14:07:50Z

1 more advantage with the use of bloc_A/blob_B we can (if we need) do the dequantise with the pack_ that made there use suitable for the CPU gemm compute.

That way it may be easy to accepte more input type...

with some more test I get the best speed on my CPU (x2,05) with

#define MR 16
#define NR 16
#define MC MR*16
#define NC NR*64

add llama_matmul_demo2_bf16.c for other // experiment

86af076

with this one we keep bloc_A on L2 cache bloc_B is shared and keep in L3/L1 cache

github-actions bot added the llamafile label Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add llama_matmul_demo2_bf16.c with other parallelize experiment #586

add llama_matmul_demo2_bf16.c with other parallelize experiment #586

Djip007 commented Oct 12, 2024

Djip007 commented Oct 12, 2024 •

edited

Loading

Djip007 commented Oct 13, 2024 •

edited

Loading

add llama_matmul_demo2_bf16.c with other parallelize experiment #586

Are you sure you want to change the base?

add llama_matmul_demo2_bf16.c with other parallelize experiment #586

Conversation

Djip007 commented Oct 12, 2024

Djip007 commented Oct 12, 2024 • edited Loading

Djip007 commented Oct 13, 2024 • edited Loading

Djip007 commented Oct 12, 2024 •

edited

Loading

Djip007 commented Oct 13, 2024 •

edited

Loading