-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci : integrate with ggml-org/ci #2250
Conversation
540a225
to
2a62535
Compare
ggml-ci
ggml-ci
3a1b8af
to
b7df039
Compare
if [ -z $GG_BUILD_LOW_PERF ]; then | ||
rm -rf ${SRC}/models-mnt | ||
|
||
mnt_models=$(realpath ${MNT}/models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the example in #2250 of create a fresh tmp
folder and pass ./tmp/mnt
as ${MNT}
arg:
mkdir tmp
bash ./ci/run.sh ./tmp/results ./tmp/mnt
This line will fail calling to realpath
because is concatenating early, making an inexistent path:
realpath: /Users/mac/code/llama.cpp/tmp/mnt/models: No such file or directory
IMHO the expected behavior should be like this:
diff --git a/ci/run.sh b/ci/run.sh
index c823bc4..41804f1 100644
--- a/ci/run.sh
+++ b/ci/run.sh
@@ -243,7 +243,7 @@ function gg_sum_open_llama_3b_v2 {
if [ -z $GG_BUILD_LOW_PERF ]; then
rm -rf ${SRC}/models-mnt
- mnt_models=$(realpath ${MNT}/models)
+ mnt_models=$(realpath ${MNT})/models
mkdir -p ${mnt_models}
ln -sfn ${mnt_models} ${SRC}/models-mnt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's strange, I thought that this should prevent it from happening:
Line 12 in 294f424
MNT=$(realpath "$2") |
It works on my Ubuntu and Mac OS machines.
If you confirm by double-checking that it fails after recreating a fresh ./tmp
, then I'll push the proposed change, but I don't see how it can help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ggerganov,
I ran into this problem when trying to run this script on my MacOS (M2 Max) with a clear checkout.
There is a difference between Ubuntu and MacOS in the way bash is evaluating this realpath line.
Run this command in both OS and you will see the weird difference:
mkdir ./tmp
echo `realpath ./tmp/this_is_an_inexistent_path`
MacOS:
realpath: ./tmp/this_is_an_inexistent_path: No such file or directory
Ubuntu:
/home/mrsparc/code/llama.cpp/tmp/this_is_an_inexistent_path
I created the following ci/test.sh
script that simplifies the approach used in the original to show the difference between OS and the proposal to fix it:
#/bin/bash
if [ -z "$2" ]; then
echo "usage: $0 <output-dir> <mnt-dir>"
exit 1
fi
mkdir -p "$1"
mkdir -p "$2"
OUT=$(realpath "$1")
MNT=$(realpath "$2")
SRC=`pwd`
echo "OUT realpath: ${OUT}"
echo "MNT realpath: ${MNT}"
echo "SRC: ${SRC}"
mnt_models=$(realpath ${MNT}/models)
echo "mnt_models: ${mnt_models}"
mnt_models_fixed=$(realpath ${MNT})/models
echo "mnt_models_fixed: ${mnt_models_fixed}"
Result on MacOS:
mkdir ./tmp
bash ./ci/test.sh ./tmp/results ./tmp/mnt
OUT realpath: /Users/mrsparc/Developer/llama.cpp/tmp/results
MNT realpath: /Users/mrsparc/Developer/llama.cpp/tmp/mnt
SRC: /Users/mrsparc/Developer/llama.cpp
realpath: /Users/mrsparc/Developer/llama.cpp/tmp/mnt/models: No such file or directory
mnt_models:
mnt_models_fixed: /Users/mrsparc/Developer/llama.cpp/tmp/mnt/models
Result on Ubuntu:
mkdir ./tmp
bash ./ci/test.sh ./tmp/results ./tmp/mnt
OUT realpath: /home/mrsparc/code/llama.cpp/tmp/results
MNT realpath: /home/mrsparc/code/llama.cpp/tmp/mnt
SRC: /home/mrsparc/code/llama.cpp
mnt_models: /home/mrsparc/code/llama.cpp/tmp/mnt/models
mnt_models_fixed: /home/mrsparc/code/llama.cpp/tmp/mnt/models
Fixing the script I was finally able to run full CI on my local machine and these were the results if you find it interesting to take a look:
Results on Apple M2 Max 12-core CPU 38-core GPU 96GB
### open_llama_3b_v2OpenLLaMA 3B-v2:
- status: 0
- perplexity:
- f16 @ 8.5293 OK
- q8_0 @ 8.5655 OK
- q4_0 @ 8.9745 OK
- q4_1 @ 9.1845 OK
- q5_0 @ 8.8299 OK
- q5_1 @ 8.6805 OK
- q3_k @ 9.5844 OK
- q4_k @ 9.0285 OK
- q5_k @ 8.7041 OK
- q6_k @ 8.5635 OK
- f16:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-f16.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-f16.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 7465.87 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to do something great for humanity. If you are not doing that, then what's your purpose in being born?
I love photography and filmmaking/editing! It allows me a creative way to explore all the beautiful things around us every day.. [end of text]
llama_print_timings: load time = 210.98 ms
llama_print_timings: sample time = 36.61 ms / 53 runs ( 0.69 ms per token, 1447.81 tokens per second)
llama_print_timings: prompt eval time = 348.28 ms / 8 tokens ( 43.53 ms per token, 22.97 tokens per second)
llama_print_timings: eval time = 3105.93 ms / 52 runs ( 59.73 ms per token, 16.74 tokens per second)
llama_print_timings: total time = 3495.01 ms
real 0m4.048s
user 0m27.696s
sys 0m1.054s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-f16.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724027
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-f16.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 7439.87 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.69 seconds per pass - ETA 0 minutes
[1]4.2669,[2]7.2736,[3]8.5293,
llama_print_timings: load time = 1718.03 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4268.92 ms / 384 tokens ( 11.12 ms per token, 89.95 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4333.69 ms
real 0m4.561s
user 0m5.291s
sys 0m0.768s
- q8_0:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q8_0.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 4381.15 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to experience it and find out what’s in store for you.
I love traveling, meeting new people especially from different cultures! So far my biggest international trip was when i went with a friend here at HBUC on our way home we stopped by Paris France (the city of lights), but the most fun
llama_print_timings: load time = 116.43 ms
llama_print_timings: sample time = 44.40 ms / 64 runs ( 0.69 ms per token, 1441.47 tokens per second)
llama_print_timings: prompt eval time = 131.18 ms / 8 tokens ( 16.40 ms per token, 60.99 tokens per second)
llama_print_timings: eval time = 2220.21 ms / 63 runs ( 35.24 ms per token, 28.38 tokens per second)
llama_print_timings: total time = 2401.49 ms
real 0m2.533s
user 0m18.621s
sys 0m0.648s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q8_0.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724032
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 4355.15 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.46 seconds per pass - ETA 0 minutes
[1]4.2715,[2]7.3133,[3]8.5655,
llama_print_timings: load time = 1466.72 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 3895.19 ms / 384 tokens ( 10.14 ms per token, 98.58 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 3944.11 ms
real 0m3.971s
user 0m5.096s
sys 0m0.549s
- q4_0:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q4_0.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 2796.19 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to create, explore and enjoy. That’s why my work often involves creating something new or taking a look into an old topic in another way that has not been explored before.
In this blog it will mainly be about what inspires me for art/design projects but also other things I feel inspired by like travelling
llama_print_timings: load time = 74.21 ms
llama_print_timings: sample time = 44.70 ms / 64 runs ( 0.70 ms per token, 1431.74 tokens per second)
llama_print_timings: prompt eval time = 138.32 ms / 8 tokens ( 17.29 ms per token, 57.84 tokens per second)
llama_print_timings: eval time = 1463.84 ms / 63 runs ( 23.24 ms per token, 43.04 tokens per second)
llama_print_timings: total time = 1652.36 ms
real 0m1.738s
user 0m12.584s
sys 0m0.400s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q4_0.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724036
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 2770.19 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.40 seconds per pass - ETA 0 minutes
[1]4.1754,[2]7.8351,[3]8.9745,
llama_print_timings: load time = 1409.26 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 3856.99 ms / 384 tokens ( 10.04 ms per token, 99.56 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 3904.17 ms
real 0m3.929s
user 0m5.151s
sys 0m0.446s
- q4_1:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q4_1.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q4_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 3 (mostly Q4_1)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 2994.31 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to have fun, so that's what we do.
Living in New York City and going for bike rides with my dog...and taking photos everywhere! [end of text]
llama_print_timings: load time = 79.20 ms
llama_print_timings: sample time = 23.52 ms / 34 runs ( 0.69 ms per token, 1445.64 tokens per second)
llama_print_timings: prompt eval time = 148.47 ms / 8 tokens ( 18.56 ms per token, 53.88 tokens per second)
llama_print_timings: eval time = 779.92 ms / 33 runs ( 23.63 ms per token, 42.31 tokens per second)
llama_print_timings: total time = 954.32 ms
real 0m1.046s
user 0m7.455s
sys 0m0.384s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q4_1.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724040
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q4_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 3 (mostly Q4_1)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 2968.31 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.97 seconds per pass - ETA 0 minutes
[1]4.4332,[2]8.1054,[3]9.1845,
llama_print_timings: load time = 1985.46 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 5561.09 ms / 384 tokens ( 14.48 ms per token, 69.05 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 5609.09 ms
real 0m5.634s
user 0m6.779s
sys 0m0.540s
- q5_0:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q5_0.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q5_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 8 (mostly Q5_0)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3192.43 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to do something worthwhile and meaningful. My purpose in this world, then, would be for me not just make ends meet but also give back to society what others have given me so that everyone could live happier lives too." -Michael
This website was made with a lot of love by Michael Hing. [end of text]
llama_print_timings: load time = 87.98 ms
llama_print_timings: sample time = 45.82 ms / 63 runs ( 0.73 ms per token, 1375.04 tokens per second)
llama_print_timings: prompt eval time = 164.64 ms / 8 tokens ( 20.58 ms per token, 48.59 tokens per second)
llama_print_timings: eval time = 1673.24 ms / 62 runs ( 26.99 ms per token, 37.05 tokens per second)
llama_print_timings: total time = 1888.68 ms
real 0m1.989s
user 0m14.604s
sys 0m0.451s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q5_0.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724045
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q5_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 8 (mostly Q5_0)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3166.43 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.54 seconds per pass - ETA 0 minutes
[1]4.3727,[2]7.6258,[3]8.8299,
llama_print_timings: load time = 1554.33 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4300.45 ms / 384 tokens ( 11.20 ms per token, 89.29 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4347.28 ms
real 0m4.371s
user 0m5.556s
sys 0m0.497s
- q5_1:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q5_1.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3390.55 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to give something back and make a difference.
I have been in business for 20 years now with companies such as Kellogg’s, Rowntree Mackintosh and Tesco! As part of our company ethos we support charities within local communities where possible so that they can reach their goals;
llama_print_timings: load time = 96.11 ms
llama_print_timings: sample time = 45.73 ms / 64 runs ( 0.71 ms per token, 1399.58 tokens per second)
llama_print_timings: prompt eval time = 171.68 ms / 8 tokens ( 21.46 ms per token, 46.60 tokens per second)
llama_print_timings: eval time = 1848.87 ms / 63 runs ( 29.35 ms per token, 34.07 tokens per second)
llama_print_timings: total time = 2071.32 ms
real 0m2.180s
user 0m16.050s
sys 0m0.479s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q5_1.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724050
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3364.55 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.52 seconds per pass - ETA 0 minutes
[1]4.2748,[2]7.4203,[3]8.6805,
llama_print_timings: load time = 1526.48 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4171.38 ms / 384 tokens ( 10.86 ms per token, 92.06 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4217.65 ms
real 0m4.245s
user 0m5.422s
sys 0m0.477s
- q3_k:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q3_k.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q3_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 12 (mostly Q3_K - Medium)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 2586.41 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to make it worthwhile, and that everyone has a purpose in this world.
I am passionate about what you bring into your life: health & wellness; community service projects with youth or senior citizens (we need people my age); volunteering at events like Relay for Life – we can do more as women! I hope
llama_print_timings: load time = 76.44 ms
llama_print_timings: sample time = 44.62 ms / 64 runs ( 0.70 ms per token, 1434.40 tokens per second)
llama_print_timings: prompt eval time = 190.59 ms / 8 tokens ( 23.82 ms per token, 41.97 tokens per second)
llama_print_timings: eval time = 1899.48 ms / 63 runs ( 30.15 ms per token, 33.17 tokens per second)
llama_print_timings: total time = 2140.13 ms
real 0m2.228s
user 0m16.518s
sys 0m0.371s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q3_k.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724054
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q3_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 12 (mostly Q3_K - Medium)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 2560.41 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.51 seconds per pass - ETA 0 minutes
[1]4.4517,[2]8.0657,[3]9.5844,
llama_print_timings: load time = 1524.91 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4253.67 ms / 384 tokens ( 11.08 ms per token, 90.27 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4302.28 ms
real 0m4.327s
user 0m5.546s
sys 0m0.464s
- q4_k:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q4_k.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q4_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3012.68 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to make a difference, and that there are no limits on what you can accomplish.
I am proud...the way we work together at Lundin Legal has created an environment where each member feels welcomed in any new challenges or opportunities presented within our firm." [end of text]
llama_print_timings: load time = 81.58 ms
llama_print_timings: sample time = 36.70 ms / 53 runs ( 0.69 ms per token, 1444.14 tokens per second)
llama_print_timings: prompt eval time = 166.78 ms / 8 tokens ( 20.85 ms per token, 47.97 tokens per second)
llama_print_timings: eval time = 1354.00 ms / 52 runs ( 26.04 ms per token, 38.40 tokens per second)
llama_print_timings: total time = 1562.10 ms
real 0m1.654s
user 0m12.122s
sys 0m0.409s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q4_k.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724058
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q4_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 2986.68 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.51 seconds per pass - ETA 0 minutes
[1]4.2788,[2]7.6835,[3]9.0285,
llama_print_timings: load time = 1517.63 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4173.48 ms / 384 tokens ( 10.87 ms per token, 92.01 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4221.51 ms
real 0m4.246s
user 0m5.403s
sys 0m0.524s
- q5_k:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q5_k.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q5_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 17 (mostly Q5_K - Medium)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3350.21 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to give yourself. We are all so busy with our lives that we forget about ourselves and what makes us feel loved, appreciated or special!
I want a woman who can be herself in my presence!! Not pretentious but confident!!! No games no drama just love sex n fun times together i promise u wont regret
llama_print_timings: load time = 93.42 ms
llama_print_timings: sample time = 44.17 ms / 64 runs ( 0.69 ms per token, 1449.05 tokens per second)
llama_print_timings: prompt eval time = 241.56 ms / 8 tokens ( 30.20 ms per token, 33.12 tokens per second)
llama_print_timings: eval time = 1890.88 ms / 63 runs ( 30.01 ms per token, 33.32 tokens per second)
llama_print_timings: total time = 2182.23 ms
real 0m2.287s
user 0m16.935s
sys 0m0.479s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q5_k.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724063
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q5_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 17 (mostly Q5_K - Medium)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3324.21 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.59 seconds per pass - ETA 0 minutes
[1]4.3469,[2]7.5239,[3]8.7041,
llama_print_timings: load time = 1598.12 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4412.82 ms / 384 tokens ( 11.49 ms per token, 87.02 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4460.90 ms
real 0m4.486s
user 0m5.662s
sys 0m0.533s
- q6_k:
+ ./bin/main --model ../models-mnt/open-llama/3B-v2/ggml-model-q6_k.bin -s 1234 -n 64 -p 'I believe the meaning of life is'
main: build = 849 (d01bccd)
main: seed = 1234
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q6_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 18 (mostly Q6_K)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3687.73 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 162.50 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to create a family that loves and protects each other. My husband has been my best friend for 12 years, we have three amazing children together (Mason -7, Ellie-5 and Clementine-(3). We met at college when he was in his first year studying Maths and I graduated
llama_print_timings: load time = 144.84 ms
llama_print_timings: sample time = 44.24 ms / 64 runs ( 0.69 ms per token, 1446.75 tokens per second)
llama_print_timings: prompt eval time = 208.43 ms / 8 tokens ( 26.05 ms per token, 38.38 tokens per second)
llama_print_timings: eval time = 1999.88 ms / 63 runs ( 31.74 ms per token, 31.50 tokens per second)
llama_print_timings: total time = 2257.85 ms
real 0m2.417s
user 0m17.571s
sys 0m0.896s
+ ./bin/perplexity --model ../models-mnt/open-llama/3B-v2/ggml-model-q6_k.bin -f ../models-mnt/wikitext/wikitext-2-raw/wiki.test-60.raw -c 128 -b 128 --chunks 3
main: build = 849 (d01bccd)
main: seed = 1689724067
llama.cpp: loading model from ../models-mnt/open-llama/3B-v2/ggml-model-q6_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 240
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 18 (mostly Q6_K)
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.06 MB
llama_model_load_internal: mem required = 3661.73 MB (+ 682.00 MB per state)
llama_new_context_with_model: kv self size = 40.62 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 3 chunks, batch_size=128
perplexity: 1.68 seconds per pass - ETA 0 minutes
[1]4.2841,[2]7.3233,[3]8.5635,
llama_print_timings: load time = 1686.78 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4729.79 ms / 384 tokens ( 12.32 ms per token, 81.19 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4777.77 ms
real 0m4.804s
user 0m5.914s
sys 0m0.591s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed info!
I just pushed the following change:
- mnt_models=$(realpath ${MNT}/models)
+ mnt_models=${MNT}/models
I think it should be equivalent to the proposed fix
ref ggerganov/ggml#295
Description
In addition to Github Actions
llama.cpp
uses a custom CI framework:https://github.com/ggml-org/ci
It monitors the
master
branch for new commits and runs the ci/run.sh script on dedicated cloud instances. Thisallows us to execute heavier workloads compared to just using Github Actions. Also with time, the cloud instances will be scaled to cover various hardware architectures, including GPU and Apple Silicon instances.
Collaborators can optionally trigger the CI run by adding the
ggml-ci
keyword to their commit message.Only the branches of this repo are monitored for this keyword.
It is a good practice, before publishing changes to execute the full CI locally on your machine:
TODO: