Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : test dot product q4_0 x f32 #1043

Closed
wants to merge 1 commit into from
Closed

ggml : test dot product q4_0 x f32 #1043

wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 18, 2023

Plugged @ikawrakow's idea from #1041

On master, I get ~51 ms / token:

 $  make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/main/main.cpp ggml.o llama.o common.o -o main  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/quantize-stats/quantize-stats.cpp ggml.o llama.o -o quantize-stats  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding  -framework Accelerate

====  Run ./main -h for help.  ====

main: seed = 3
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to serve others.
I am a mother, wife and daughter who believes in community service and helping others. My career started as a legal assistant for a criminal defense attorney but soon realized that I was more interested in assisting my clients with their personal matters than with their court cases. I switched to working as
llama_print_timings:        load time =   398.00 ms
llama_print_timings:      sample time =    47.12 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   380.06 ms /     8 tokens (   47.51 ms per token)
llama_print_timings:        eval time =  3270.89 ms /    63 runs   (   51.92 ms per run)
llama_print_timings:       total time =  3717.02 ms

On this branch I get ~226 ms / token for same run:

$  make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 3
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to learn, love and leave a legacy.
I believe that if you give it away, it will always come back. If you treat others with kindness and respect, they will reciprocate in time.
If you put out good energy, it will return to you.
What I would like my children
llama_print_timings:        load time =  1595.40 ms
llama_print_timings:      sample time =    47.13 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =  1586.76 ms /     8 tokens (  198.35 ms per token)
llama_print_timings:        eval time = 14298.88 ms /    63 runs   (  226.97 ms per run)
llama_print_timings:       total time = 15942.39 ms

If I have to guess, at 8 threads the computation becomes memory bound and therefore, even though the Q4_0 x F32 is faster, the Q4_0 x Q8_0 ends up being more performant due to the less memory data being used

// wdata += row_size;
// }
// }
//}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I disable the Q8_0 quantization - we don't need it

@slaren
Copy link
Member

slaren commented Apr 18, 2023

I tested this a while ago because I was also very surprised that quantizing first is faster, but it is indeed faster. I even tried an AVX2 implementation of ggml_vec_dot_q4_0_f32. Here it is in case anyone is interested, though it doesn't incorporate the latest optimizations to ggml_vec_dot_q4_0:

static void ggml_vec_dot_q4_0_f32(const int n, float * restrict s, const void * restrict vx, const float * restrict y) {
    const int nb = n / QK;

    assert(n % QK == 0);
    assert(nb % 2 == 0);

    const block_q4_0 * restrict x = (const block_q4_0*)vx;

    __m256 acc = _mm256_setzero_ps();

    // Main loop
    for (int i = 0; i < nb; ++i) {
        const __m256 d_v = _mm256_broadcast_ss(&x[i].d);

        // Load 32x4-bit integers into 32x8-bit integers
        __m256i vx8 = bytesFromNibbles(x[i].qs);

        // Subtract 8 from the integers
        vx8 = _mm256_sub_epi8(vx8, _mm256_set1_epi8(8));

        // Convert to 16-bit int
        const __m256i vx16_lo = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(vx8, 0));
        const __m256i vx16_hi = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(vx8, 1));

        // Convert to 32-bit int -> float 32
        const __m256 vf[4] = {
            _mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_lo, 0))),
            _mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_lo, 1))),
            _mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_hi, 0))),
            _mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_hi, 1)))
        };

        // Scale and fma
        const float * yj = y + i*32;
        for (int j = 0; j < 4; j++) {
            const __m256 jx = _mm256_mul_ps(vf[j], d_v);
            const __m256 jy = _mm256_loadu_ps(yj + j*8);
            acc = _mm256_fmadd_ps(jx, jy, acc);
        }
    }

    // Return horizontal sum of the acc vector
    __m128 res = _mm256_extractf128_ps( acc, 1 );
    res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
    res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
    res = _mm_add_ss( res, _mm_movehdup_ps( res ) );

    *s = _mm_cvtss_f32( res );
}

@ggerganov ggerganov closed this Apr 19, 2023
@ikawrakow
Copy link
Contributor

The thing that I did not realize is that the number of values being quantized is many times less than the number of values used in the dot products (512X or more) . Which means that my simple test code from #1041 does not adequately measure the actual situation where the amount of time spent on quantization is small compared to the amount of time spent on dot products between quantized values. If I only consider the dot products, then indeed the quantized version is faster.

Sorry for wasting @ggerganov's time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants