llama : greatly reduce output buffer memory usage #6122

compilade · 2024-03-17T23:41:55Z

Supersedes #2700.

As I've noted in #6017 (comment), the logits buffer is way too big and mostly unused when the batch size is very large. This PR fixes this waste of memory by introducing another, smaller buffer of ids (ctx->output_ids) which point into the logits and embeddings buffers, allowing to keep most of the behavior of llama_get_logits_ith() and llama_get_embeddings_ith() while making the output buffer's content contiguous. While I was at it, I've also noticed it was relatively easy to skip computing unused logits with the new output buffer layout introduced in this change.

From @slaren's suggestion in #6017 (comment), this PR allocates space for n_seq_max outputs, then dynamically reallocates the output buffer when more logits and/or embeddings are necessary.
(Thanks for making me think to look for #2700 when alluding to it in #6017 (comment))

API changes

llama_get_logits and llama_get_embeddings now return contiguous arrays (still float *).
- The logits and/or the embeddings for which llama_batch.logits[i] == true are stored contiguously in the order they have in the batch.
- The layout is the same as before if logits_all == true and when using llama_batch_get_one() (which doesn't set llama_batch.logits).
- Only the necessary logits and embeddings are computed, so reducing the use of logits_all now has a slight performance incentive
llama_get_logits_ith and llama_get_embeddings_ith now always verify if the passed id is valid.
- The result was undefined in some cases before.
- Note that the perplexity example previously relied on the non-verification of this (assertion error when running with a Debug build on master), which has been fixed. (in other words, this fixes perplexity assert fails #6246)
The session file format changed slightly, and is now much smaller (a lot of space for unused logits was wasted before). 🎉
- For example, storing the state of Tinyllama with a batch size of 512 previously used at least 63 MiB, but now the equivalent session file takes only 1 MiB (mostly composed of the used KV cache cells from my test prompt, it now only includes the needed logits from the last batch), regardless of the batch size, whereas a batch size of 1024 previously would use around 126 MiB with Tinyllama.
- It's now possible to load a session file with a different context size than what it was saved with, as long as the new context size is at least enough to contain the state for the saved tokens.

Notes

The perplexity example used the previous layout of the logits very extensively.
I've adapted most of it, but the parts of it which still use logits_all (even though this still works) will need to be changed in the future to benefit from skipping the computation of unused logits.

The perplexity example should have the exact same output as master when compiled with the same flags. ~~Well, except with Winogrande, because I've modified the implementation.~~ (EDIT: Winogrande output should be the same as master as of 8f70dcb.)
(note to @ikawrakow: the logic for skipping the choice words in winogrande_score() shouldn't be required, but I still kept it. so I've simplified it by also fixing the logits used when evaluating the second choice (previously, the logits of the end of the first choice were used there instead of the logits of the end of the common prefix, which caused a big skew in the log-likelyhood of some second choices). This will need to be fixed in a separate PR.)

TODO

Since a (small) model-specific change was required for each of the 23+ architectures supported by llama.cpp, I'd like to at least ensure I didn't break any of them. I'd really like to know if it works on GPU and/or with MoE models.

Feel free to edit the list below to add more tested models and examples.

Compare the output with --temp 0 when using this PR vs master.

Mamba (on CPU: main, parallel (with and without -cb), perplexity (v1, v2, hellaswag, multiple-choices), imatrix, save-load-state, embedding)
LLama (with Tinyllama (on CPU: speculative (with Llama-160M as a draft model), save-load-state))
- EDIT (2024-03-19): seems to differ now, investigating... False alarm, it's because repetition penalty is now disabled by default in master since common : disable repeat penalties by default #6127, so for the outputs to be the same, --repeat-penalty 1 now has to be passed.
Phi-2 (on CPU: main, save-load-state)
GritLM
BERT (from the CI test-suite: server)
Any Moe Model (Mixtral (with HipBLAS: llama-bench, main))
(17 other model types (TODO: expand this list))

Known issues:

~~server with embeddings~~
Some backends don't support GGML_OP_GET_ROWS and don't fallback properly, (e.g. the Vulkan backend)

The first logits used to evaluate the second choice were not from the end of the common prefix; instead, they were the logits from the end of the first choice. This has been corrected. The previous implementation sometimes had outliers in the scores of choices for some tasks, and the logic to skip choices words in the log-likelihood evaluation probably was an attempt to reduce those, but it was complex and didn't quite seem to be the right thing. This is simpler now, and the outlier scores aren't there anymore.

A mismatch happened when using a smaller n_ubatch than n_batch and then using llama_batch_get_one(). The decision of what n_outputs should be now almost fully depends on how lctx.n_outputs is set in llama_decode_internal. The conditions are simpler this way. * llama : when saving the state, recalculate n_outputs This ensures the correct number of outputs for the entire previous batch is stored in the session file, even when n_ubatch is smaller than n_batch.

llama.cpp

slaren · 2024-03-18T00:21:11Z

As it is, this breaks pipeline parallelism because changes in the graph topology force a synchronization. I think it should be possible to fix this if the final get_rows is done unconditionally.

You can test this without multiple GPUs by building in debug with LLAMA_DEBUG, and using llama-bench with n_batch > n_ubatch, or perplexity with multple sequences. If you get ggml_backend_sched_alloc_splits: failed to allocate graph, reserving messages on every eval, then it is breaking pipeline parallelism.

It previously worked because lctx.inp_out_ids was not initialized, so it pointed to some garbage address which was somehow still valid when I ran my tests.

compilade · 2024-03-18T01:06:33Z

As it is, this breaks pipeline parallelism because changes in the graph topology force a synchronization. I think it should be possible to fix this if the final get_rows is done unconditionally.

Yes, this should be possible. I initially put the skip of the rest of the graph there because tensors with 0 rows caused division by zero problems in ggml_can_repeat. But working around this could allow keeping the same graphs (assuming no other problems, like more divisions by zero elsewhere).

You can test this without multiple GPUs by building in debug with LLAMA_DEBUG, and using llama-bench with n_batch > n_ubatch, or perplexity with multple sequences. If you get ggml_backend_sched_alloc_splits: failed to allocate graph, reserving messages on every eval, then it is breaking pipeline parallelism.

I'll see if I can try this on my Intel UHD Graphics 615 (since splits don't seem to happen on CPU-only).
But on CPU, when using the parallel example (and also perplexity with multiple sequences), I do see a bunch of

ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: reallocating buffers automatically

which may be problematic when using a GPU.

I'll try to avoid changing the graph topology.

slaren · 2024-03-18T01:14:14Z

I'll see if I can try this on my Intel UHD Graphics 615 (since splits don't seem to happen on CPU-only).

Yes, I forgot that the reallocation is automatic if there is only one backend, but that message also shows the problem. If that message disappears, then it should also work with pipeline parallelism.

* ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1

fgdfgfthgr-fox · 2024-03-18T06:04:32Z

Test result using Radeon VII HipBLAS build under Linux environment.
I don't see any significant decrease in VRAM use though. In what case should this pr 's change to be significant?
(please ignore the koboldcpp file path. It's running using llama.cpp, I just store my model weight under koboldcpp's file path)
master:

$ ./llama-bench -m /mnt/2878EBCCAED823C6/koboldcpp-rocm/mixtral/mixtral-8x7b-iq3_xxs.gguf -ngl 22
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon VII, compute capability 9.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B IQ3_XXS - 3.0625 bpw  |  16.99 GiB |    46.70 B | ROCm       |  22 | pp 512     |     87.74 ± 0.11 |
| llama 7B IQ3_XXS - 3.0625 bpw  |  16.99 GiB |    46.70 B | ROCm       |  22 | tg 128     |      7.55 ± 0.21 |

build: d01b3c4c (2450)
Peak VRAM Used: 13376MB

$ ./main -ngl 22 -m /mnt/2878EBCCAED823C6/koboldcpp-rocm/mixtral/mixtral-8x7b-iq3_xxs.gguf -n 64 --temp 0  -p "In a land before time"
...
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 1


 In a land before time, when the internet was still in its infancy and social media didn’t exist, there were only two ways to get your message out: print or broadcast.

Print is great for reaching people who are already interested in what you have to say – but it can be expensive and hard to measure results. Broad
llama_print_timings:        load time =    2615.59 ms
llama_print_timings:      sample time =       7.09 ms /    64 runs   (    0.11 ms per token,  9026.80 tokens per second)
llama_print_timings: prompt eval time =     715.49 ms /     6 tokens (  119.25 ms per token,     8.39 tokens per second)
llama_print_timings:        eval time =    8572.94 ms /    63 runs   (  136.08 ms per token,     7.35 tokens per second)
llama_print_timings:       total time =    9315.96 ms /    69 tokens

pr:

$ ./llama-bench -m /mnt/2878EBCCAED823C6/koboldcpp-rocm/mixtral/mixtral-8x7b-iq3_xxs.gguf -ngl 22
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon VII, compute capability 9.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B IQ3_XXS - 3.0625 bpw  |  16.99 GiB |    46.70 B | ROCm       |  22 | pp 512     |     90.56 ± 0.13 |
| llama 7B IQ3_XXS - 3.0625 bpw  |  16.99 GiB |    46.70 B | ROCm       |  22 | tg 128     |      7.41 ± 0.13 |

build: 6bf7f3f4 (2464)
Peak VRAM Used: 13325MB

$ ./main -ngl 22 -m /mnt/2878EBCCAED823C6/koboldcpp-rocm/mixtral/mixtral-8x7b-iq3_xxs.gguf -n 64 --temp 0  -p "In a land before time"
...
generate: n_ctx = 512, n_batch = 2048, n_predict = 64, n_keep = 1


 In a land before time, when the internet was still in its infancy and social media didn’t exist, there were only two ways to get your message out: print or broadcast.

Print is great for reaching people who are already interested in what you have to say – but it can be expensive and hard to measure results. Broad
llama_print_timings:        load time =    2659.17 ms
llama_print_timings:      sample time =       7.00 ms /    64 runs   (    0.11 ms per token,  9148.08 tokens per second)
llama_print_timings: prompt eval time =     551.66 ms /     6 tokens (   91.94 ms per token,    10.88 tokens per second)
llama_print_timings:        eval time =    8741.20 ms /    63 runs   (  138.75 ms per token,     7.21 tokens per second)
llama_print_timings:       total time =    9320.75 ms /    69 tokens

Dampfinchen · 2024-03-18T09:10:40Z

That sounds really promising! However, when trying to run this PR, I'm getting this error:

fixed or limited cutCUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_op_flatten at /userdir/llama.cpp memory/llama.cpp/ggml-cuda.cu:9149
  cudaGetLastError()
GGML_ASSERT: /userdir/llama.cpp/ggml-cuda.cu:267: !"CUDA error"
Abgebrochen (Speicherabzug geschrieben)

I was compiling it with CMake, Cuda Toolkit 12.3. RTX 2060, Core i7 9750H, 32 GB.

Does this PR only support llama or its derivatives like Mistral as well?

My model was Fimbulvetr-v2 at IQ4_XS, which is a Solar 10.7B merge.

ggml.c

llama.cpp

* llama : rework reallocation logic for llama_output_reserve Now comparing the actual size with the new total size of the output buffer to allow more efficient enabling and disabling of the embeddings and/or logits output in the future.

compilade · 2024-03-19T02:27:12Z

Answering 2 messages at once:

I don't see any significant decrease in VRAM use though.

@fgdfgfthgr-fox The expected decrease is at most n_vocab*(n_batch - 1)*sizeof(float) bytes, so for Mixtral, if its n_vocab is around 50000 and if using n_batch == 512, the maximum decrease will be 97 MiB, while with llama-bench, you're seeing a decrease of 50 MiB, which seems reasonable. BTW, thanks for testing this with Mixtral :)

In what case should this pr 's change to be significant?

It should be significant when using a very large batch size. For example, with -b 8192, this PR could save up to 1.5 GiB of memory (or 3.05 GiB of memory with -b 16384, etc.) compared to master. (EDIT: this should only affect RAM, not VRAM, because the output buffer is using a CPU buffer (ref: #6122 (comment)). This still means bigger logical batch sizes will be usable without wasting lots of RAM, hopefully making pipeline parallelism more scalable.)
With llama-bench, I think this can be tested by adding the flags -b 8192 -p 8192, while with main, -b 8192 -c 8192 could be used (to ensure the batch size won't be clamped by the context size). Of course, if the KV cache (unaffected by this PR) is too big for your system, you can use smaller numbers.

That sounds really promising! However, when trying to run this PR, I'm getting this error:

@Dampfinchen Thanks for testing this with CUDA. A more complete backtrace would be helpful¹, but in this case this is probably happening because I didn't yet make the other backends than CPU skip empty tensors, so what you're seeing is likely the symptoms of a GPU-accelerated division by zero. Hopefully fixed by 8b826c5. This didn't happen to @fgdfgfthgr-fox because they didn't offload enough layers for the last one to be offloaded (which can now have tensors with no elements when no logits are used, causing problems when dividing dimensions with each other if not skipped).

Does this PR only support llama or its derivatives like Mistral as well?

The goal is for this to support every of the 23+ model architectures supported by llama.cpp. In theory, they should all work, but maybe there's something I overlooked somewhere², which is why I encourage testing models of different types and different backends. Mistral should work, and from #6122 (comment), Mixtral too.

Apparently, GGML_ASSERT prints full backtraces when gdb is installed on Linux and Unix-like systems. I didn't know for a long time because the only debugger I had already installed was lldb. Knowing this (or coredumpctl debug --debugger=lldb, followed by bt) two months ago would have saved me a lot of printf-based debugging. ↩
like the Vulkan backend not properly supporting GGML_GET_ROWS... Probably wasn't too important before because ggml_get_rows was mostly used on input tensors which are always using the CPU backend. (but ggml_get_rows was also used for MoE inference, so backends on which Mixtral worked likely do not have this problem) ↩

slaren · 2024-03-19T02:29:59Z

Works well with CUDA, improves pp performance by 2-3% with a single GPU.

GPU	Model	Model Size [GiB]	Test	t/s master	t/s PR	Speedup
RTX 3090 Ti	llama 7B F16	12.55	pp512	5288.37	5378.25	1.02
RTX 3090 Ti	llama 7B F16	12.55	pp1024	5031.00	5144.94	1.02
RTX 3090 Ti	llama 7B F16	12.55	pp2048	4600.25	4710.85	1.02
RTX 3090 Ti	llama 7B F16	12.55	pp4096	3956.06	4032.93	1.02
RTX 3090 Ti	llama 7B Q4_0	3.56	pp512	4395.17	4532.78	1.03
RTX 3090 Ti	llama 7B Q4_0	3.56	pp1024	4190.07	4316.03	1.03
RTX 3090 Ti	llama 7B Q4_0	3.56	pp2048	3904.58	4011.76	1.03
RTX 3090 Ti	llama 7B Q4_0	3.56	pp4096	3429.24	3512.89	1.02
RTX 3090 Ti	llama 7B F16	12.55	tg128	54.76	54.81	1.00
RTX 3090 Ti	llama 7B Q4_0	3.56	tg128	126.81	126.19	1.00

GPU	Model	Model Size [GiB]	Test	t/s master	t/s PR	Speedup
RTX 3090 Ti/RTX 3080	llama 7B F16	12.55	pp512	4726.59	4861.25	1.03
RTX 3090 Ti/RTX 3080	llama 7B F16	12.55	pp1024	5724.30	5759.07	1.01
RTX 3090 Ti/RTX 3080	llama 7B F16	12.55	pp2048	6146.06	6129.15	1.00
RTX 3090 Ti/RTX 3080	llama 7B F16	12.55	pp4096	5783.30	5803.46	1.00
RTX 3090 Ti/RTX 3080	llama 7B Q4_0	3.56	pp512	3949.96	3869.63	0.98
RTX 3090 Ti/RTX 3080	llama 7B Q4_0	3.56	pp1024	4825.07	4909.32	1.02
RTX 3090 Ti/RTX 3080	llama 7B Q4_0	3.56	pp2048	5275.33	5319.94	1.01
RTX 3090 Ti/RTX 3080	llama 7B Q4_0	3.56	pp4096	5026.15	5091.23	1.01
RTX 3090 Ti/RTX 3080	llama 7B F16	12.55	tg128	48.34	47.83	0.99
RTX 3090 Ti/RTX 3080	llama 7B Q4_0	3.56	tg128	108.05	108.19	1.00

slaren · 2024-03-19T02:33:56Z

Note that the output buffer is always allocated in a CPU buffer, so this shouldn't affect VRAM usage.

Dampfinchen · 2024-03-19T13:08:25Z

@compilade I can confirm the issue I've had has been fixed. Good work!

Wuzzooy · 2024-03-28T08:59:47Z

command-r seems broken, it works until build b2536.
I'm getting "GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:3648: ggml_can_repeat(b, a)"

INFO [ server_params_parse] logging to file is disabled. | tid="196984" timestamp=1711615937
INFO [ main] build info | tid="196984" timestamp=1711615937 build=2554 commit="25f4a613"
INFO [ main] system info | tid="196984" timestamp=1711615937 n_threads=7 n_threads_batch=-1 total_threads=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "
llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from G:\AI\models\c4ai-command-r-v01-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = command-r
llama_model_loader: - kv 1: general.name str = c4ai-command-r-v01
llama_model_loader: - kv 2: command-r.block_count u32 = 40
llama_model_loader: - kv 3: command-r.context_length u32 = 131072
llama_model_loader: - kv 4: command-r.embedding_length u32 = 8192
llama_model_loader: - kv 5: command-r.feed_forward_length u32 = 22528
llama_model_loader: - kv 6: command-r.attention.head_count u32 = 64
llama_model_loader: - kv 7: command-r.attention.head_count_kv u32 = 64
llama_model_loader: - kv 8: command-r.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 9: command-r.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 14
llama_model_loader: - kv 11: command-r.logit_scale f32 = 0.062500
llama_model_loader: - kv 12: command-r.rope.scaling.type str = none
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,253333] = ["─á ─á", "─á t", "e r", "i n", "─á a...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 5
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 255001
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 41 tensors
llama_model_loader: - type q4_K: 271 tensors
llama_model_loader: - type q5_K: 9 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 1008/256000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = command-r
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 253333
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 64
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 8192
llm_load_print_meta: n_embd_v_gqa = 8192
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 6.2e-02
llm_load_print_meta: n_ff = 22528
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = none
llm_load_print_meta: freq_base_train = 8000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 35B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 34.98 B
llm_load_print_meta: model size = 18.97 GiB (4.66 BPW)
llm_load_print_meta: general.name = c4ai-command-r-v01
llm_load_print_meta: BOS token = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 136 '├ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.37 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 1640.62 MiB
llm_load_tensors: CUDA0 buffer size = 5434.38 MiB
llm_load_tensors: CUDA1 buffer size = 13989.53 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1536.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 3584.00 MiB
llama_new_context_with_model: KV self size = 5120.00 MiB, K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.95 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model: CUDA0 compute buffer size = 672.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 672.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 48.02 MiB
llama_new_context_with_model: graph nodes = 1247
llama_new_context_with_model: graph splits = 3
[1711615946] warming up the model with an empty run
INFO [ init] initializing slots | tid="196984" timestamp=1711615946 n_slots=1
INFO [ init] new slot | tid="196984" timestamp=1711615946 id_slot=0 n_ctx_slot=4096
INFO [ main] model loaded | tid="196984" timestamp=1711615946
INFO [ main] chat template | tid="196984" timestamp=1711615946 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [ main] HTTP server listening | tid="196984" timestamp=1711615946 hostname="127.0.0.1" port="8080" n_threads_http="7"
INFO [ update_slots] all slots are idle | tid="196984" timestamp=1711615946
INFO [ launch_slot_with_task] slot is processing task | tid="196984" timestamp=1711615976 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="196984" timestamp=1711615976 id_slot=0 id_task=0 p0=0
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:3648: ggml_can_repeat(b, a)

compilade · 2024-03-28T10:34:38Z

command-r seems broken, it works until build b2536.
I'm getting "GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:3648: ggml_can_repeat(b, a)"

@Wuzzooy I'm truly sorry about that, it seems the model's graph was structured a bit differently than the other models (inpSA was named ffn_inp instead), and I overlooked a tensor which should have gone through the new inp_out_ids dance.

Should be fixed by #6367 (hopefully).

* Support xverse model convert to gguf format. * 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; * * gguf-py: remove redundant logs * llama: remove the init_mapping_prefetch custom parameter * llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers. * - Fix format issues - Remove duplicate set kqv_out to llm_build_kv * Update llama.cpp --------- Co-authored-by: willhe <willhe@xverse.cn> Co-authored-by: willhe <hexin@xverse.cn>

op_getrows_f32 is required since ggerganov#6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

* llama : greatly reduce logits memory usage * llama : more compact state saving and reloading * llama : fix lctx.n_outputs not being set before building graph * perplexity : adapt to the logits API changes * perplexity : fix Winogrande, use correct logits for second choice start The first logits used to evaluate the second choice were not from the end of the common prefix; instead, they were the logits from the end of the first choice. This has been corrected. The previous implementation sometimes had outliers in the scores of choices for some tasks, and the logic to skip choices words in the log-likelihood evaluation probably was an attempt to reduce those, but it was complex and didn't quite seem to be the right thing. This is simpler now, and the outlier scores aren't there anymore. * perplexity : normalize spaces and punctuation in Winogrande sentences * llama : fix embedding conditions * llama : fix llama_get_embeddings_ith when the resulting id is 0 * llama : fix wrong n_outputs in llama_set_inputs A mismatch happened when using a smaller n_ubatch than n_batch and then using llama_batch_get_one(). The decision of what n_outputs should be now almost fully depends on how lctx.n_outputs is set in llama_decode_internal. The conditions are simpler this way. * llama : when saving the state, recalculate n_outputs This ensures the correct number of outputs for the entire previous batch is stored in the session file, even when n_ubatch is smaller than n_batch. * llama : fix not-skipping outputs of non-causal models * llama : fix running a batch with n_outputs == 0 It previously worked because lctx.inp_out_ids was not initialized, so it pointed to some garbage address which was somehow still valid when I ran my tests. * llama : keep same graph topology even when n_outputs == 0 * ggml : saner ggml_can_repeat with empty tensors * ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1 * ggml : do not multi-thread ops returning empty tensors * ggml : make ggml_is_empty public and work with views * llama : use a vector for ctx->output_ids * llama : rework reallocation logic for llama_output_reserve Now comparing the actual size with the new total size of the output buffer to allow more efficient enabling and disabling of the embeddings and/or logits output in the future. * ggml : skip empty tensors in all backends * llama : fix llama_output_reserve nullptr deref when new_size is 0 * perplexity : make Winogrande work as it does on master The problems with the Winogrande implementation will need to be fixed in a separate PR to ease review. * llama : clearer error messages for invalid logits or embeddings ids * llama : assert all models that can have inp_out_ids Since the graph topology is now constant, this presence check can be done even when there are no outputs. * llama : assert logits and embd buffers exist before writing to them * llama : handle errors from llama_output_reserve at call sites * perplexity : make hellaswag and multiple-choice outputs identical to master Due to how the KV cache is updated, the logprobs for tokens in a batch are very slightly affected by the other tokens present in the batch, so to make hellaswag and multiple-choice return exactly the same results as on master, the last token of each sequence needs to be evaluated even though its output is not used at all. This will probably be changed back in the future to make these benchmarks a tiny bit faster. * perplexity : fix division by zero when using less than 100 multiple-choice tasks * llama : allow loading state saved with a different ctx size When loading a session file, the context size is now only required to be at least enough to load the KV cells contained in that session file, instead of requiring to use exactly the same context size as when saving. Doing this enables the use-case of extending or shrinking the context size of a saved session. This breaks existing session files because the meaning of kv_buf_size is slightly changed (previously it was the size of the whole KV cache, now it's only the size of the saved part of it). This allows for finer-grained sanity checks when loading in an effort to keep kv_buf_size useful even when the kv_size is changed. * llama : minor ggml-ci * readme : update recent API changes, and warn about Vulkan --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Support xverse model convert to gguf format. * 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; * * gguf-py: remove redundant logs * llama: remove the init_mapping_prefetch custom parameter * llama.cpp: Include the changes from ggerganov#6122 to exclude the unused outputs of the last layers. * - Fix format issues - Remove duplicate set kqv_out to llm_build_kv * Update llama.cpp --------- Co-authored-by: willhe <willhe@xverse.cn> Co-authored-by: willhe <hexin@xverse.cn>

* llama : greatly reduce logits memory usage * llama : more compact state saving and reloading * llama : fix lctx.n_outputs not being set before building graph * perplexity : adapt to the logits API changes * perplexity : fix Winogrande, use correct logits for second choice start The first logits used to evaluate the second choice were not from the end of the common prefix; instead, they were the logits from the end of the first choice. This has been corrected. The previous implementation sometimes had outliers in the scores of choices for some tasks, and the logic to skip choices words in the log-likelihood evaluation probably was an attempt to reduce those, but it was complex and didn't quite seem to be the right thing. This is simpler now, and the outlier scores aren't there anymore. * perplexity : normalize spaces and punctuation in Winogrande sentences * llama : fix embedding conditions * llama : fix llama_get_embeddings_ith when the resulting id is 0 * llama : fix wrong n_outputs in llama_set_inputs A mismatch happened when using a smaller n_ubatch than n_batch and then using llama_batch_get_one(). The decision of what n_outputs should be now almost fully depends on how lctx.n_outputs is set in llama_decode_internal. The conditions are simpler this way. * llama : when saving the state, recalculate n_outputs This ensures the correct number of outputs for the entire previous batch is stored in the session file, even when n_ubatch is smaller than n_batch. * llama : fix not-skipping outputs of non-causal models * llama : fix running a batch with n_outputs == 0 It previously worked because lctx.inp_out_ids was not initialized, so it pointed to some garbage address which was somehow still valid when I ran my tests. * llama : keep same graph topology even when n_outputs == 0 * ggml : saner ggml_can_repeat with empty tensors * ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1 * ggml : do not multi-thread ops returning empty tensors * ggml : make ggml_is_empty public and work with views * llama : use a vector for ctx->output_ids * llama : rework reallocation logic for llama_output_reserve Now comparing the actual size with the new total size of the output buffer to allow more efficient enabling and disabling of the embeddings and/or logits output in the future. * ggml : skip empty tensors in all backends * llama : fix llama_output_reserve nullptr deref when new_size is 0 * perplexity : make Winogrande work as it does on master The problems with the Winogrande implementation will need to be fixed in a separate PR to ease review. * llama : clearer error messages for invalid logits or embeddings ids * llama : assert all models that can have inp_out_ids Since the graph topology is now constant, this presence check can be done even when there are no outputs. * llama : assert logits and embd buffers exist before writing to them * llama : handle errors from llama_output_reserve at call sites * perplexity : make hellaswag and multiple-choice outputs identical to master Due to how the KV cache is updated, the logprobs for tokens in a batch are very slightly affected by the other tokens present in the batch, so to make hellaswag and multiple-choice return exactly the same results as on master, the last token of each sequence needs to be evaluated even though its output is not used at all. This will probably be changed back in the future to make these benchmarks a tiny bit faster. * perplexity : fix division by zero when using less than 100 multiple-choice tasks * llama : allow loading state saved with a different ctx size When loading a session file, the context size is now only required to be at least enough to load the KV cells contained in that session file, instead of requiring to use exactly the same context size as when saving. Doing this enables the use-case of extending or shrinking the context size of a saved session. This breaks existing session files because the meaning of kv_buf_size is slightly changed (previously it was the size of the whole KV cache, now it's only the size of the saved part of it). This allows for finer-grained sanity checks when loading in an effort to keep kv_buf_size useful even when the kv_size is changed. * llama : minor ggml-ci * readme : update recent API changes, and warn about Vulkan --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Support xverse model convert to gguf format. * 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; * * gguf-py: remove redundant logs * llama: remove the init_mapping_prefetch custom parameter * llama.cpp: Include the changes from ggerganov#6122 to exclude the unused outputs of the last layers. * - Fix format issues - Remove duplicate set kqv_out to llm_build_kv * Update llama.cpp --------- Co-authored-by: willhe <willhe@xverse.cn> Co-authored-by: willhe <hexin@xverse.cn>

* llama : greatly reduce logits memory usage * llama : more compact state saving and reloading * llama : fix lctx.n_outputs not being set before building graph * perplexity : adapt to the logits API changes * perplexity : fix Winogrande, use correct logits for second choice start The first logits used to evaluate the second choice were not from the end of the common prefix; instead, they were the logits from the end of the first choice. This has been corrected. The previous implementation sometimes had outliers in the scores of choices for some tasks, and the logic to skip choices words in the log-likelihood evaluation probably was an attempt to reduce those, but it was complex and didn't quite seem to be the right thing. This is simpler now, and the outlier scores aren't there anymore. * perplexity : normalize spaces and punctuation in Winogrande sentences * llama : fix embedding conditions * llama : fix llama_get_embeddings_ith when the resulting id is 0 * llama : fix wrong n_outputs in llama_set_inputs A mismatch happened when using a smaller n_ubatch than n_batch and then using llama_batch_get_one(). The decision of what n_outputs should be now almost fully depends on how lctx.n_outputs is set in llama_decode_internal. The conditions are simpler this way. * llama : when saving the state, recalculate n_outputs This ensures the correct number of outputs for the entire previous batch is stored in the session file, even when n_ubatch is smaller than n_batch. * llama : fix not-skipping outputs of non-causal models * llama : fix running a batch with n_outputs == 0 It previously worked because lctx.inp_out_ids was not initialized, so it pointed to some garbage address which was somehow still valid when I ran my tests. * llama : keep same graph topology even when n_outputs == 0 * ggml : saner ggml_can_repeat with empty tensors * ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1 * ggml : do not multi-thread ops returning empty tensors * ggml : make ggml_is_empty public and work with views * llama : use a vector for ctx->output_ids * llama : rework reallocation logic for llama_output_reserve Now comparing the actual size with the new total size of the output buffer to allow more efficient enabling and disabling of the embeddings and/or logits output in the future. * ggml : skip empty tensors in all backends * llama : fix llama_output_reserve nullptr deref when new_size is 0 * perplexity : make Winogrande work as it does on master The problems with the Winogrande implementation will need to be fixed in a separate PR to ease review. * llama : clearer error messages for invalid logits or embeddings ids * llama : assert all models that can have inp_out_ids Since the graph topology is now constant, this presence check can be done even when there are no outputs. * llama : assert logits and embd buffers exist before writing to them * llama : handle errors from llama_output_reserve at call sites * perplexity : make hellaswag and multiple-choice outputs identical to master Due to how the KV cache is updated, the logprobs for tokens in a batch are very slightly affected by the other tokens present in the batch, so to make hellaswag and multiple-choice return exactly the same results as on master, the last token of each sequence needs to be evaluated even though its output is not used at all. This will probably be changed back in the future to make these benchmarks a tiny bit faster. * perplexity : fix division by zero when using less than 100 multiple-choice tasks * llama : allow loading state saved with a different ctx size When loading a session file, the context size is now only required to be at least enough to load the KV cells contained in that session file, instead of requiring to use exactly the same context size as when saving. Doing this enables the use-case of extending or shrinking the context size of a saved session. This breaks existing session files because the meaning of kv_buf_size is slightly changed (previously it was the size of the whole KV cache, now it's only the size of the saved part of it). This allows for finer-grained sanity checks when loading in an effort to keep kv_buf_size useful even when the kv_size is changed. * llama : minor ggml-ci * readme : update recent API changes, and warn about Vulkan --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Support xverse model convert to gguf format. * 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; * * gguf-py: remove redundant logs * llama: remove the init_mapping_prefetch custom parameter * llama.cpp: Include the changes from ggerganov#6122 to exclude the unused outputs of the last layers. * - Fix format issues - Remove duplicate set kqv_out to llm_build_kv * Update llama.cpp --------- Co-authored-by: willhe <willhe@xverse.cn> Co-authored-by: willhe <hexin@xverse.cn>

op_getrows_f32 is required since #6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

== Relevant log messages from source repo: commit 3d7ebf63123b8652fb7bbecef7ba731202309901 Author: 0cc4m <picard12@live.de> Date: Mon Jun 3 10:59:14 2024 +0200 Vulkan Mixture of Experts (MoE) support (#7628) * Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU commit a10cda58d3199cd85305e0f03a8c6056714ae2e8 Author: Andy Tai <andy-tai@users.noreply.github.com> Date: Mon Jun 3 01:06:24 2024 -0700 cmake : add pkg-config spec file for llama.cpp (#7702) commit 6f28a333c1e3fdfdc7b4f9d0367f2b41a9b7e9d4 Author: zhangkaihuo <zhangkaihuo@gmail.com> Date: Mon Jun 3 15:49:30 2024 +0800 llama : MiniCPM support tied embeddings (#7664) * support lm_head * remove the code block --------- Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn> commit 549279d8049d78620a2b081e26edb654f83c3bbd Author: Georgi Gerganov <ggerganov@gmail.com> Date: Mon Jun 3 08:34:43 2024 +0300 llama : avoid double token-to-piece cache (#7654) ggml-ci commit 9e405b6e2ecb888e860f7b92720b4809e21b3915 Author: woachk <24752637+woachk@users.noreply.github.com> Date: Mon Jun 3 07:32:16 2024 +0200 kompute : implement op_getrows_f32 (#6403) op_getrows_f32 is required since ggerganov/llama.cpp#6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

op_getrows_f32 is required since ggerganov/llama.cpp#6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

compilade added 9 commits March 17, 2024 15:36

llama : greatly reduce logits memory usage

1fd1918

llama : more compact state saving and reloading

98914c0

llama : fix lctx.n_outputs not being set before building graph

705d393

perplexity : adapt to the logits API changes

25981fc

perplexity : normalize spaces and punctuation in Winogrande sentences

d0129e8

llama : fix embedding conditions

487f89e

llama : fix llama_get_embeddings_ith when the resulting id is 0

408fcb0

slaren reviewed Mar 18, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama : fix not-skipping outputs of non-causal models

a57fa7f

llama : fix running a batch with n_outputs == 0

711b0bc

It previously worked because lctx.inp_out_ids was not initialized, so it pointed to some garbage address which was somehow still valid when I ran my tests.

compilade added 3 commits March 17, 2024 22:04

llama : keep same graph topology even when n_outputs == 0

d100502

ggml : saner ggml_can_repeat with empty tensors

99c37cc

* ggml : future-proof ggml_is_empty by using GGML_MAX_DIMS - 1

ggml : do not multi-thread ops returning empty tensors

6bf7f3f

slaren reviewed Mar 18, 2024

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

slaren reviewed Mar 18, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

compilade added 4 commits March 18, 2024 20:21

ggml : make ggml_is_empty public and work with views

09bb15a

llama : use a vector for ctx->output_ids

4551e7e

* llama : rework reallocation logic for llama_output_reserve Now comparing the actual size with the new total size of the output buffer to allow more efficient enabling and disabling of the embeddings and/or logits output in the future.

ggml : skip empty tensors in all backends

8b826c5

llama : fix llama_output_reserve nullptr deref when new_size is 0

d04cfaf

compilade mentioned this pull request Mar 28, 2024

llama : fix command-r inference when omitting outputs #6367

Merged

compilade mentioned this pull request Mar 29, 2024

Vulkan evaluation failed: ggml_vulkan: Error: Missing op: GET_ROWS #6376

Closed

woachk mentioned this pull request Mar 31, 2024

kompute: implement op_getrows_f32 #6403

Merged

DisOOM mentioned this pull request Apr 3, 2024

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

Draft

compilade mentioned this pull request Apr 7, 2024

llama_sampling_sample with default args is more naively usable #6519

Merged

compilade mentioned this pull request Apr 30, 2024

Server: add test for num slots, fails on master #6950

Merged

slaren mentioned this pull request May 9, 2024

Vulkan Bugfixes and Improvements #7084

Merged

JohannesGaessler mentioned this pull request May 19, 2024

Custom seed values ignored by llama.cpp HTTP server #7381

Closed

ggerganov pushed a commit that referenced this pull request Jun 3, 2024

kompute : implement op_getrows_f32 (#6403)

9e405b6

op_getrows_f32 is required since #6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

compilade mentioned this pull request Jul 24, 2024

Random seed possible problems. #8593

Closed

compilade mentioned this pull request Aug 11, 2024

llama : support RWKV v6 models #8980

Merged

2 tasks

compilade mentioned this pull request Aug 21, 2024

Fix todo: avoid relying on logits_all == true in perplexity_v2 #9102

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : greatly reduce output buffer memory usage #6122

llama : greatly reduce output buffer memory usage #6122

compilade commented Mar 17, 2024 •

edited

Loading

slaren commented Mar 18, 2024

compilade commented Mar 18, 2024

slaren commented Mar 18, 2024

fgdfgfthgr-fox commented Mar 18, 2024 •

edited

Loading

Dampfinchen commented Mar 18, 2024

compilade commented Mar 19, 2024 •

edited

Loading

slaren commented Mar 19, 2024 •

edited

Loading

slaren commented Mar 19, 2024

Dampfinchen commented Mar 19, 2024

Wuzzooy commented Mar 28, 2024 •

edited

Loading

compilade commented Mar 28, 2024

llama : greatly reduce output buffer memory usage #6122

llama : greatly reduce output buffer memory usage #6122

Conversation

compilade commented Mar 17, 2024 • edited Loading

API changes

Notes

TODO

slaren commented Mar 18, 2024

compilade commented Mar 18, 2024

slaren commented Mar 18, 2024

fgdfgfthgr-fox commented Mar 18, 2024 • edited Loading

Dampfinchen commented Mar 18, 2024

compilade commented Mar 19, 2024 • edited Loading

Footnotes

slaren commented Mar 19, 2024 • edited Loading

slaren commented Mar 19, 2024

Dampfinchen commented Mar 19, 2024

Wuzzooy commented Mar 28, 2024 • edited Loading

compilade commented Mar 28, 2024

compilade commented Mar 17, 2024 •

edited

Loading

fgdfgfthgr-fox commented Mar 18, 2024 •

edited

Loading

compilade commented Mar 19, 2024 •

edited

Loading

slaren commented Mar 19, 2024 •

edited

Loading

Wuzzooy commented Mar 28, 2024 •

edited

Loading