Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX2 implementation of dequantize_row_q4_0 #467

Merged
merged 1 commit into from
Mar 25, 2023

Conversation

slaren
Copy link
Member

@slaren slaren commented Mar 24, 2023

I couldn't notice a big performance improvement, more testing necessary

@slaren
Copy link
Member Author

slaren commented Mar 24, 2023

A quick performance test shows significant improvement in the function itself (with k=4096):

Running ./test-dq
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.56, 1.26, 1.46
----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
BM_dequantize_row_q4_0           10351 ns        10351 ns        66698
BM_dequantize_row_q4_0_avx2       1384 ns         1384 ns       509491

@slaren
Copy link
Member Author

slaren commented Mar 24, 2023

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values.
[1]4.5690,[2]5.2058,[3]6.0526,

@slaren slaren marked this pull request as ready for review March 24, 2023 17:14
@Green-Sky
Copy link
Collaborator

@ggerganov we need some sort of benchmarking suite for ggml.

@slaren how complex is the ./test-dq ? can you provide the code, does it require the model files or is it standalone? (should be easy to create synthetic data)

@slaren
Copy link
Member Author

slaren commented Mar 24, 2023

It's a standalone test using the google benchmark library. Here is the code: https://gist.github.com/slaren/ba732ed08abd0ba148129eab3335dfb7
To do that, I split the avx and scalar implementations into dequantize_row_q4_0_avx2 and dequantize_row_q4_0 beforehand.

@ggerganov
Copy link
Member

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values. [1]4.5690,[2]5.2058,[3]6.0526,

The dequantize functions are only used if you link against BLAS and use -b 32 or bigger:

make clean
LLAMA_OPENBLAS=1 make

Otherwise, they will never be called.

@slaren
Copy link
Member Author

slaren commented Mar 24, 2023

@ggerganov that's not what I am seeing, here is a stack trace for example:

#2  0x00005555555660d4 in dequantize_row_q4_0 (x=0x7ffedc43a0d0, y=0x7ffe585b81a0, k=k@entry=4096) at ggml.c:767
#3  0x000055555556b1e7 in ggml_compute_forward_get_rows_q4_0 (params=<optimized out>, params=<optimized out>,
    dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030) at ggml.c:7249
#4  ggml_compute_forward_get_rows (dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030, params=<optimized out>)
    at ggml.c:7345
#5  ggml_compute_forward (params=<optimized out>, tensor=0x7ffe585b8100) at ggml.c:9027
#6  0x0000555555571435 in ggml_graph_compute (ctx=<optimized out>, cgraph=0x7ffffffe4d90) at ggml.c:9911
#7  0x00005555555793f5 in llama_eval_internal (lctx=..., tokens=<optimized out>, n_tokens=4, n_past=0, n_threads=<optimized out>)
    at llama.cpp:822
#8  0x000055555557976d in llama_eval (ctx=<optimized out>, tokens=<optimized out>, n_tokens=<optimized out>,
    n_past=<optimized out>, n_threads=<optimized out>) at llama.cpp:1493
#9  0x000055555555c396 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:224

@ggerganov
Copy link
Member

Ah yes - there is one exception -- the ggml_get_rows at the start of the inference. It is a very lightweight call so I don't expect it to take a measurable amount of time.

@slaren
Copy link
Member Author

slaren commented Mar 24, 2023

Ah I see. I am running some tests with BLAS now, will report back when I have some results. Unfortunately it seems to be much slower, probably need to find a better BLAS library than just using the libopenblas-dev package from ubuntu..

@slaren
Copy link
Member Author

slaren commented Mar 24, 2023

@ggerganov When building with BLAS, -b 32 and a long enough prompt I only get garbage generation (not just bad, but random tokens). This happens on master too. Is it possible that BLAS support is broken at the moment?

@ggerganov
Copy link
Member

Yes, it is broken. Weird ..

@ggerganov
Copy link
Member

Ok, BLAS has been fixed and for large prompts and batch size ( > 256) there is significant benefit to enable BLAS.
Tested on M1 so far, but I expect the same results for x86

@ggerganov ggerganov merged commit 09aecbf into ggml-org:master Mar 25, 2023
@slaren slaren deleted the avx-dequantize branch March 25, 2023 15:43
@slaren
Copy link
Member Author

slaren commented Mar 25, 2023

I am seeing a very significant improvement with x86 as well, for instance the perplexity computation went from ~8 hours to ~5 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants