Skip to content

[Bug] Different outputs when undefining GGML_SIMD #766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
ivanstepanovftw opened this issue Apr 4, 2023 · 2 comments
Closed
4 tasks done

[Bug] Different outputs when undefining GGML_SIMD #766

ivanstepanovftw opened this issue Apr 4, 2023 · 2 comments

Comments

@ivanstepanovftw
Copy link
Collaborator

ivanstepanovftw commented Apr 4, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Same model outputs with enabled and disabled SIMD instructions.

Environment and Context

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 5800U with Radeon Graphics
    CPU family:          25
    Model:               80
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  70%
    CPU max MHz:         4505.0781
    CPU min MHz:         1600.0000
    BogoMIPS:            3792.78
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constan
                         t_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdra
                         nd lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_ll
                         c mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb
                          sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_l
                         ock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq
                          rdpid overflow_recov succor smca fsrm

  • Operating System, e.g. for Linux:

$ uname -a

Linux fedora 6.2.8-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:11:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ g++ --version

g++ (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)

Failure Information (for bugs)

Seems like there is a bug in ggml_vec_dot_f16.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Evaluate the model
  2. Undefine GGML_SIMD
  3. Evaluate the model
  4. See difference

Output

With defined GGML_SIMD (current behavior)

$ ./main -m ./models/7B/ggml-model-q4_0.bin -c 128 -t 12 -n 40 -s 1680277445 -p '## Question: What is best in life? ## Jeeves: '
main: seed = 1680277445
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 128
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =   64.00 MB

system_info: n_threads = 12 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 128, n_batch = 8, n_predict = 40, n_keep = 0


 ## Question: What is best in life? ## Jeeves: 42 ## Bertie Wooster: That’s got to be the answer! I mean, you can’t argue with it. ## Question: What is best in life? ## Jee
llama_print_timings:        load time =  1139.16 ms
llama_print_timings:      sample time =    21.33 ms /    40 runs   (    0.53 ms per run)
llama_print_timings: prompt eval time =   953.96 ms /    16 tokens (   59.62 ms per token)
llama_print_timings:        eval time =  7448.75 ms /    39 runs   (  190.99 ms per run)
llama_print_timings:       total time =  9072.95 ms

With undefined GGML_SIMD:

make && ./main -m ./models/7B/ggml-model-q4_0.bin -c 128 -t 12 -n 40 -s 1680277445 -p '## Question: What is best in life? ## Jeeves: '
main: seed = 1680277445
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 128
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =   64.00 MB

system_info: n_threads = 12 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 128, n_batch = 8, n_predict = 40, n_keep = 0


 ## Question: What is best in life? ## Jeeves: 42, sir.
Were a big fan of the RSA Animate series here at TNW, and today we have another great video to show you that we think you will enjoy
llama_print_timings:        load time =  1043.75 ms
llama_print_timings:      sample time =    21.00 ms /    40 runs   (    0.53 ms per run)
llama_print_timings: prompt eval time =   894.68 ms /    16 tokens (   55.92 ms per token)
llama_print_timings:        eval time =  7486.81 ms /    39 runs   (  191.97 ms per run)
llama_print_timings:       total time =  8990.76 ms

As is, but with inverted branch #if defined(GGML_SIMD) in inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) like so: #if !defined(GGML_SIMD):

main: seed = 1680277445
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 128
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =   64.00 MB

system_info: n_threads = 12 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 128, n_batch = 8, n_predict = 40, n_keep = 0


 ## Question: What is best in life? ## Jeeves: 42, sir.
Were a big fan of the RSA Animate series here at TNW, and today we have another great video to show you that we think you will enjoy
llama_print_timings:        load time =  1237.43 ms
llama_print_timings:      sample time =    20.65 ms /    40 runs   (    0.52 ms per run)
llama_print_timings: prompt eval time =  1051.29 ms /    16 tokens (   65.71 ms per token)
llama_print_timings:        eval time =  7831.31 ms /    39 runs   (  200.80 ms per run)
llama_print_timings:       total time =  9617.35 ms

Git blame shows @ggerganov

@ivanstepanovftw
Copy link
Collaborator Author

ivanstepanovftw commented Apr 4, 2023

I have checked the difference between optimized and not optimized

    if (fabs(sumf - sumf2) > 1e-5) {
        printf("sumf = %f, sumf2 = %f\n", sumf, sumf2);
    }
 ## Question: What is best insumf = -100.021179, sumf2 = -100.021189
sumf = -88.343201, sumf2 = -88.343212
sumf = -112.589180, sumf2 = -112.589190
sumf = -147.711853, sumf2 = -147.711864
sumf = -139.611694, sumf2 = -139.611704
sumf = -75.613693, sumf2 = -75.613704
sumf = -33.389286, sumf2 = -33.389276
sumf = -64.019043, sumf2 = -64.019031
sumf = -65.800705, sumf2 = -65.800715
sumf = -51.870102, sumf2 = -51.870112
sumf = -78.770439, sumf2 = -78.770450
sumf = -66.570221, sumf2 = -66.570232
sumf = -66.457993, sumf2 = -66.458004
sumf = -85.540108, sumf2 = -85.540118
sumf = -48.718475, sumf2 = -48.718486
sumf = -67.481247, sumf2 = -67.481233
sumf = -71.567154, sumf2 = -71.567139
sumf = -68.680618, sumf2 = -68.680605
sumf = -51.564140, sumf2 = -51.564153
sumf = -203.349487, sumf2 = -203.349500
sumf = -70.176712, sumf2 = -70.176702
sumf = -127.550659, sumf2 = -127.550673
sumf = -277.602295, sumf2 = -277.602284
sumf = -630.498535, sumf2 = -630.498567
sumf = -185.314880, sumf2 = -185.314892
sumf = -191.300049, sumf2 = -191.300036
sumf = -134.104126, sumf2 = -134.104114
sumf = -321.063751, sumf2 = -321.063762
...

For epsilon 1e-4 there is no output.

@ivanstepanovftw
Copy link
Collaborator Author

Perplexity test.

With SIMD enabled in ggml_vec_dot_f16:

[1]4.5619,[2]5.1787,[3]6.0491,[4]6.7494,[5]6.7852,[6]6.7668,[7]6.9845,[8]7.0877,^C

Without SIMD in ggml_vec_dot_f16:

[1]4.7091,[2]5.4287,[3]6.2562,[4]6.9295,[5]6.9925,[6]6.9556,[7]7.1235,[8]7.2074,^C

Tested with:

./perplexity -m models/7B/ggml-model-q4_0.bin -f ~/Downloads/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw -t 8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant