Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(ggml): tall and skinny GEMM for LoRA: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected #956

Closed
jon-chuang opened this issue Apr 14, 2023 · 13 comments
Labels

Comments

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 14, 2023

Context:
LoRA requires computing $W' = W + \Delta W$ where $\Delta W = BA^T$ and $A,B$ are tall and skinny.

See #820 for the use-case.

Problem
The estimated FLOPs of these matmuls are 16 * 5120 * 5120 ~= 0.419 GFLOPs.

My setup can do 360 GFLOPs for f16 mul_mat (open blas). This would imply a computing time of ~1.2ms for float16, so float32 should be in the same ballpark.

The model's mul_mat(5120 X 5120], [5120 X 8]) of FLOP complexity 8 * 5120 * 5120 ~= 0.210 GFLOPs takes 2.647 ms

So at the least, we should be getting ~5ms, not 120ms. So, we are observing a 24x deviation from the expected compute time.

The above observed time is roughly the same with or without OpenBLAS.


EDIT: with the optimizations described below, we are now at about 30-35ms. So we have about 6-7x left to go.

@jon-chuang jon-chuang changed the title perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms each perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms each - 24x slower than expected Apr 14, 2023
@jon-chuang jon-chuang changed the title perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms each - 24x slower than expected perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected Apr 14, 2023
@jon-chuang jon-chuang changed the title perf(ggml): F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected perf(ggml): tall and skinny GEMM: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected Apr 14, 2023
@jon-chuang
Copy link
Contributor Author

See: https://arxiv.org/pdf/2208.08088.pdf, a specialized GEMM for tall and skinny matrices - results in 4X speedup over BLAS, but it's not even close to 24x we need to explain here.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 14, 2023

To my understanding, the vectorization of BLAS and GGML happens along the row dimension. Tall and skinny matrices essentially get 0 vectorization.

Correct? @ggerganov

I will try to unroll the loops to get the vectorization.

@MillionthOdin16
Copy link

Where are you seeing dimensions like mul_mat([16 X 5120], [16 X 5120])? All my timings are accumulations of a bunch of 1ms mul_mats.

I'm seeing outputs more like this:

 - 1022: [ 62, 1, 40]    DIAG_MASK_INF   (  1) cpu =   0.001 /   0.001 ms, wall =   0.001 /   0.001 ms
 - 1023: [ 62, 1, 40]         SOFT_MAX   (  1) cpu =   0.009 /   0.009 ms, wall =   0.009 /   0.009 ms
 - 1024: [ 128, 1, 40]          MUL_MAT   (  1) cpu =   0.114 /   0.114 ms, wall =   0.115 /   0.115 ms
 - 1025: [ 128, 40, 1]          PERMUTE   (  1) cpu =   0.001 /   0.001 ms, wall =   0.001 /   0.001 ms
 - 1026: [ 5120, 1, 1]              CPY   (  1) cpu =   0.004 /   0.004 ms, wall =   0.004 /   0.004 ms
 - 1027: [ 5120, 1, 1]          MUL_MAT   (  1) cpu =   0.471 /   0.471 ms, wall =   0.471 /   0.471 ms
 - 1028: [ 5120, 1, 1]              ADD   (  1) cpu =   0.006 /   0.006 ms, wall =   0.006 /   0.006 ms
 - 1029: [ 5120, 1, 1]         RMS_NORM   (  1) cpu =   0.009 /   0.009 ms, wall =   0.009 /   0.009 ms
 - 1030: [ 5120, 1, 1]              MUL   (  1) cpu =   0.003 /   0.003 ms, wall =   0.004 /   0.004 ms
 - 1031: [ 13824, 1, 1]          MUL_MAT   (  1) cpu =   1.257 /   1.257 ms, wall =   1.259 /   1.259 ms
 - 1032: [ 13824, 1, 1]             SILU   (  1) cpu =   0.053 /   0.053 ms, wall =   0.053 /   0.053 ms
 - 1033: [ 13824, 1, 1]          MUL_MAT   (  1) cpu =   1.106 /   1.106 ms, wall =   1.109 /   1.109 ms
 - 1034: [ 13824, 1, 1]              MUL   (  1) cpu =   0.008 /   0.008 ms, wall =   0.008 /   0.008 ms
 - 1035: [ 5120, 1, 1]          MUL_MAT   (  1) cpu =   1.112 /   1.112 ms, wall =   1.116 /   1.116 ms

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 14, 2023

The timings you show are from multiplying square-ish matrices with short and wide matrices (vectors).

We are multiplying tall and skinny matrices to obtain the delta update from LoRA adapters (low-rank adapters from fine-tuning).

See: https://arxiv.org/abs/2106.09685

@jon-chuang jon-chuang changed the title perf(ggml): tall and skinny GEMM: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected perf(ggml): tall and skinny GEMM for LoRA: F32 mul_mat([16 X 5120], [16 X 5120]) takes 120ms - 24x slower than expected Apr 14, 2023
@jon-chuang

This comment was marked as off-topic.

@MillionthOdin16
Copy link

LOLOL

@MillionthOdin16
Copy link

The timings you show are from multiplying square-ish matrices with short and wide matrices (vectors).

We are multiplying tall and skinny matrices to obtain the delta update from LoRA adapters (low-rank adapters from fine-tuning).

Oh dang, I thought you were talking about general optimizations relating to the general mat_mul / q_f32. Didn't know Lora. I've been looking at it all day haha

@jon-chuang
Copy link
Contributor Author

GPT4 suggests the following AVX 2 code:

Is segfaults. Let me collapse it as it's off-topic.

@jon-chuang
Copy link
Contributor Author

See: https://arxiv.org/pdf/2208.08088.pdf, a specialized GEMM for tall and skinny matrices - results in 4X speedup over BLAS, but it's not even close to 24x we need to explain here.

Actually, we are not even using BLAS, just the ggml impl since it does not fit BLAS criteria. Hopefully it's enough to account.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 14, 2023

This manages to run about 4X faster.

// A = K X M, B = K X N
void multiply_tall_skinny_matrices(const float * A, const float * BT, float * C, int M, int N, int K, int ir0, int ir1) {
    for (int i = ir0; i < ir1; ++i) {
        for (int j = 0; j < N; j += 8) { // Process 8 elements of C's row at a time - 256 / size_of(float)

            __m256 c_row = _mm256_setzero_ps(); // Initialize the result vector to all zeros

            for (int k = 0; k < K; ++k) {
                __m256 a = _mm256_broadcast_ss(&A[i + k * M]); // Broadcast the k-th element of the i-th row of A
                __m256 b = _mm256_load_ps(&BT[j + k * N]); // Load a segment of the k-th row of B^T (corresponding to the k-th column of B)
                c_row = _mm256_fmadd_ps(a, b, c_row); // FMA: c_row = a * b + c_row
            }

            // Store the result in the corresponding row of C
            _mm256_store_ps(&C[i * N + j ], c_row);
        }
    }
}

GPT4 almost delivered after all. Luckily I'm still smarter than it.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 14, 2023

AVX is about 3X faster.

void multiply_tall_skinny_matrices_avx(const float * A, const float * BT, float * C, int M, int N, int K, int ir0, int ir1) {
    for (int i = ir0; i < ir1; ++i) {
        for (int j = 0; j < N; j += 4) { // Process 8 elements of C's row at a time - 128 / size_of(float)
            __m128 c_row = _mm_setzero_ps(); // Initialize the result vector to all zeros

            for (int k = 0; k < K; ++k) {
                __m128 a = _mm_broadcast_ss(&A[i + k * M]); // Broadcast the k-th element of the i-th row of A
                __m128 b = _mm_loadu_ps(&BT[j + k * N]); // Load a segment of the k-th row of B^T (corresponding to the k-th column of B)
                c_row = _mm_fmadd_ps(a, b, c_row); // FMA: c_row = a * b + c_row
            }

            // Store the result in the corresponding row of C
            _mm_store_ps(&C[i * N + j], c_row);
        }
    }
}

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 14, 2023

I hope this can be optimized further to close the 24X-> 6X->1X gap.

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
jeroen-mostert pushed a commit to jeroen-mostert/llama.cpp that referenced this issue Aug 30, 2024
* Give the CI builds a recognizable AVX1 name

* Chat Adapters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants