Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel.cpp exits when encountering a long prompt. #4086

Closed
4 tasks done
littlebai3618 opened this issue Nov 15, 2023 · 24 comments
Closed
4 tasks done

parallel.cpp exits when encountering a long prompt. #4086

littlebai3618 opened this issue Nov 15, 2023 · 24 comments

Comments

@littlebai3618
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

In ./examples/parallel/parallel.cpp, I added the following two lines to the final output:

int cache_count = llama_get_kv_cache_token_count(ctx);
LOG_TEE("Cache KV size %d", cache_count);

I believe that the logic in line 221 of parallel.cpp:

// all sequences have ended - clear the entire KV cache
for (int i = 0; i < n_clients; ++i) {
   llama_kv_cache_seq_rm(ctx, i, n_tokens_system, -1);
}

should release all the occupied cache when the entire task is completed. However, in reality, it does not release the cache.

Current Behavior

I expected the value of cache_count to be 0, but in reality, it is 1153.

Environment and Context

  1. GPU info
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     Off  | 00000000:3B:00.0 Off |                  N/A |
| 33%   29C    P8    22W / 260W |      0MiB / 49152MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. make cmd
LLAMA_CUDA_NVCC=/usr/local/cuda-12/bin/nvcc make LLAMA_CUBLAS=1 -Wdeprecated-declarations
  1. test cmd
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16324 -b 4096 --cont_batching --parallel 10 --sequences 600 --n-gpu-layers 1000
  1. model

It appears that you are using the "CodeLlama-7B-HF" model from the repository you mentioned (https://huggingface.co/codellama/CodeLlama-7b-hf). You mentioned that you performed the conversion using the "convert.py" script included in the repository.

$ lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Stepping:                        7
CPU MHz:                         1693.308
CPU max MHz:                     3900.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4600.00
Virtualization:                  VT-x
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        32 MiB
L3 cache:                        44 MiB
  • Operating System, e.g. for Linux:

$ uname -a

Linux studio-0 4.19.96 #1 SMP Tue Mar 10 10:34:01 CST 2020 x86_64 x86_64 x86_64 GNU/Linux
  • SDK version, e.g. for Linux:
$ python3 --version 3.11.5
$ make --version  GNU Make 4.2.1
$ g++ --version  g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

Failure Information (for bugs)

I expected the value of cache_count to be 0, but in reality, it is 1153.

Steps to Reproduce

please download model from https://huggingface.co/codellama/CodeLlama-7b-hf

  1. python convert.py ./CodeLlama-7B/ --outtype q8_0
  2. after parallel.cpp's LOG_TEE("Cache misses: %6d\n", n_cache_miss); add
int cache_count = llama_get_kv_cache_token_count(ctx);
LOG_TEE("Cache KV size %d", cache_count);
  1. ./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16324 -b 4096 --cont_batching --parallel 10 --sequences 600 --n-gpu-layers 1000

Failure Logs

Example run with the Linux command perf

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Quadro RTX 8000, compute capability 7.5
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /aistudio/workspace/system-default/models/CodeLlama-7B/ggml-model-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 32016,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    8:              blk.0.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   10:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.1.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   17:              blk.1.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   19:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   20:           blk.10.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   21:           blk.10.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   22:             blk.10.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   23:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   24:             blk.10.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   25:        blk.10.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   26:             blk.10.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   27:             blk.10.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   28:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   29:           blk.11.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   30:           blk.11.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   31:             blk.11.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   32:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   33:             blk.11.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   34:        blk.11.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   35:             blk.11.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   36:             blk.11.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   37:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   38:           blk.12.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   39:           blk.12.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   40:             blk.12.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   41:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   42:             blk.12.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   43:        blk.12.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   44:             blk.12.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   45:             blk.12.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   46:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   47:           blk.13.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   48:           blk.13.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   49:             blk.13.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   50:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   51:             blk.13.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   52:        blk.13.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   53:             blk.13.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   54:             blk.13.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   55:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   56:           blk.14.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   57:           blk.14.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   58:             blk.14.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   59:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   60:             blk.14.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   61:        blk.14.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   62:             blk.14.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   63:             blk.14.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   64:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   65:           blk.15.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   66:           blk.15.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   67:             blk.15.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   68:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   69:             blk.15.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   70:        blk.15.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   71:             blk.15.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   72:             blk.15.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   73:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   74:           blk.16.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   75:           blk.16.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   76:             blk.16.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   77:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   78:             blk.16.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   79:        blk.16.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   80:             blk.16.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   81:             blk.16.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   82:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   83:           blk.17.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   84:           blk.17.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   85:             blk.17.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   86:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   87:             blk.17.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   88:        blk.17.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   89:             blk.17.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   90:             blk.17.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   91:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   92:           blk.18.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   93:           blk.18.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   94:             blk.18.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   95:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   96:             blk.18.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   97:        blk.18.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   98:             blk.18.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   99:             blk.18.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  100:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  101:           blk.19.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  102:           blk.19.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  103:             blk.19.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  104:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  105:             blk.19.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  106:        blk.19.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  107:             blk.19.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  108:             blk.19.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  109:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  110:            blk.2.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  111:            blk.2.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  112:              blk.2.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  113:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  114:              blk.2.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  115:         blk.2.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  116:              blk.2.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  117:              blk.2.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  118:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  119:           blk.20.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  120:           blk.20.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  121:             blk.20.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  122:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  123:             blk.20.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  124:        blk.20.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  125:             blk.20.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  126:             blk.20.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  127:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  128:           blk.21.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  129:           blk.21.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  130:             blk.21.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  131:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  132:             blk.21.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  133:        blk.21.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  134:             blk.21.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  135:             blk.21.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  136:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  137:           blk.22.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  138:           blk.22.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  139:             blk.22.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  140:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  141:             blk.22.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  142:        blk.22.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  143:             blk.22.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  144:             blk.22.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  145:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  146:           blk.23.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  147:           blk.23.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  148:             blk.23.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  149:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  150:             blk.23.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  151:        blk.23.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  152:             blk.23.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  153:             blk.23.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  154:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  155:            blk.3.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  156:            blk.3.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  157:              blk.3.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  158:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  159:              blk.3.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  160:         blk.3.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  161:              blk.3.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  162:              blk.3.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  163:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  164:            blk.4.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  165:            blk.4.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  166:              blk.4.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  167:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  168:              blk.4.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  169:         blk.4.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  170:              blk.4.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  171:              blk.4.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  172:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  173:            blk.5.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  174:            blk.5.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  175:              blk.5.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  176:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  177:              blk.5.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  178:         blk.5.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  179:              blk.5.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  180:              blk.5.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  181:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  182:            blk.6.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  183:            blk.6.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  184:              blk.6.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  185:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  186:              blk.6.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  187:         blk.6.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  188:              blk.6.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  189:              blk.6.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  190:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  191:            blk.7.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  192:            blk.7.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  193:              blk.7.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  194:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  195:              blk.7.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  196:         blk.7.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  197:              blk.7.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  198:              blk.7.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  199:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  200:            blk.8.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  201:            blk.8.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  202:              blk.8.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  203:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  204:              blk.8.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  205:         blk.8.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  206:              blk.8.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  207:              blk.8.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  208:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  209:            blk.9.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  210:            blk.9.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  211:              blk.9.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  212:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  213:              blk.9.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  214:         blk.9.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  215:              blk.9.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  216:              blk.9.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  217:                    output.weight q8_0     [  4096, 32016,     1,     1 ]
llama_model_loader: - tensor  218:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  219:           blk.24.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  220:           blk.24.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  221:             blk.24.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  222:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  223:             blk.24.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  224:        blk.24.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  226:             blk.24.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  228:           blk.25.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  229:           blk.25.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  230:             blk.25.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  231:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  232:             blk.25.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  233:        blk.25.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  235:             blk.25.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  237:           blk.26.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  238:           blk.26.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  239:             blk.26.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  240:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  241:             blk.26.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  242:        blk.26.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  244:             blk.26.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  246:           blk.27.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  247:           blk.27.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  248:             blk.27.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  249:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  250:             blk.27.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  251:        blk.27.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  253:             blk.27.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  255:           blk.28.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  256:           blk.28.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  257:             blk.28.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  258:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  259:             blk.28.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  260:        blk.28.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  262:             blk.28.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  264:           blk.29.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  265:           blk.29.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  266:             blk.29.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  267:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  268:             blk.29.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  269:        blk.29.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  271:             blk.29.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  273:           blk.30.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  274:           blk.30.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  276:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  277:             blk.30.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  278:        blk.30.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  279:             blk.30.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  280:             blk.30.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  282:           blk.31.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  283:           blk.31.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  285:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  286:             blk.31.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  287:        blk.31.attn_output.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.attn_q.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.attn_v.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  290:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
llama_model_loader: - kv  11:                          general.file_type u32     
llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q8_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 6.67 GiB (8.50 BPW) 
llm_load_print_meta: general.name   = models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  132.99 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 6695.89 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16324
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 8162.00 MB
llama_new_context_with_model: kv self size  = 8162.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 8615.72 MB
llama_new_context_with_model: VRAM scratch buffer: 8609.09 MB
llama_new_context_with_model: total VRAM used: 23466.99 MB (model: 6695.89 MB, context: 16771.09 MB)

No new questions so proceed with build-in defaults.


main: Simulating parallel requests from clients:
main: n_parallel = 10, n_sequences = 600, cont_batching = 1, system tokens = 305


。。。。。

main: clearing the KV cache

run parameters as at 2023-11-15 10:43:39

main: n_parallel = 10, n_sequences = 600, cont_batching = 1, system tokens = 305
External prompt file: used built-in defaults
Model and path used:  ./CodeLlama-7B/ggml-model-q8_0.gguf

Total prompt tokens:   8752, speed: 36.53 t/s
Total gen tokens:     29266, speed: 122.15 t/s
Total speed (AVG):           speed: 158.68 t/s
Cache misses:             0
Cache KV size 628

llama_print_timings:        load time =   37371.88 ms
llama_print_timings:      sample time =   12907.57 ms / 29866 runs   (    0.43 ms per token,  2313.84 tokens per second)
llama_print_timings: prompt eval time =  214845.96 ms / 38277 tokens (    5.61 ms per token,   178.16 tokens per second)
llama_print_timings:        eval time =     936.39 ms /    46 runs   (   20.36 ms per token,    49.13 tokens per second)
llama_print_timings:       total time =  239590.67 ms
@KerfuffleV2
Copy link
Collaborator

Your cache actually got cleared correctly (probably, anyway!). The problem is that llama_get_kv_cache_token_count isn't doing what the name suggests. It's actually doing something like returning the index of the last KV cache cell that's populated.

@littlebai3618
Copy link
Author

Your cache actually got cleared correctly (probably, anyway!). The problem is that llama_get_kv_cache_token_count isn't doing what the name suggests. It's actually doing something like returning the index of the last KV cache cell that's populated.

Thank you for your response. I am using the low-level API of llama-cpp-python (which is completely consistent with llama.h) to implement dynamic batch processing in Python, following the example in parallel.cpp. However, I am encountering a strange phenomenon where running the program with parallel > 1 causes an additional increase of 30-200MB in GPU memory when reaching the llama_decode() method. This increase accumulates over time until the program crashes. Yesterday, I suspected that it might be an issue with KV_cache. Currently, I am reviewing the differences between my implementation and parallel.cpp. Do you have any suggestions? For example, which parts might be causing the memory leak? Today, I observed that after the abnormal increase in GPU memory, llama_get_state_size also shows a significant increase. I'm not sure if there is any correlation. It seems that this might be related to a variable in the context.

@KerfuffleV2
Copy link
Collaborator

Do you have any suggestions?

Well, I can't tell you what the problem is but I can basically rule out the KV cache for you. You can pretty much trust that clearing the KV cache works correctly, but even if it didn't, it wouldn't matter for the purposes of memory leaks. As far as I know, all the KV cache memory gets allocated up front based on the context size you set. So for the most part, it just doesn't matter what's in it.

This increase accumulates over time until the program crashes.

Are you saying that the memory increases while running decode on a large parallel sequence, or that the memory continues increasing in between calls to decode with the parallel sequence?

In other words, when the call to decode ends and you get your result, does the memory usage go back down?

I am using the low-level API of llama-cpp-python

I'd say you'd probably have better luck asking in their repo. These other projects that build on llama.cpp aren't necessarily using the latest version, they may have their own patches applied, they may be tweaking settings, etc.

Debugging problems is basically a process of elimination, and there are too many unknowns for someone who just knows about llama.cpp to deal with in this case. Or, you can try reproducing the issue using the latest version of llama.cpp directly.

@ggerganov
Copy link
Owner

Diagnosing the llama-cpp-python problem would be difficult, and I don't have anything to add to @KerfuffleV2's comment.

Regarding llama_get_kv_cache_token_count() - as mentioned, it currently does not work and is also deprecated. However, I think we can easily restore it's functionality by counting the number of tokens in the kv cache functions. Might be useful for debugging stuff, so we should probably de-deprecate it

@KerfuffleV2
Copy link
Collaborator

I think we can easily restore it's functionality by counting the number of tokens in the kv cache functions.

Definitely would be easy. Slightly harder is answering the question "what's a token"? Should it return the number of populated cells or the sum of the cells sequence lengths? Like if a cell belongs to 10 sequences, is that 10 tokens?

@ggerganov
Copy link
Owner

I guess, we can have 2 counters - "number of tokens" and "number of occupied cells". And add API for the cells

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Nov 16, 2023

If we're changing the API, how about something that basically just exports the whole KV cache state so people can extract whatever information is useful? Maybe even add the token id to it. Even for a 200,000 context size, that's only 800k size it's a 32bit type.

Something like the batch api when you create/destroy a batch, then another function could copy state into it (since you wouldn't want to allocate every time it was fetched probably).

#4035 wanted that functionality, I think (and also having the count function fixed as well).

@littlebai3618
Copy link
Author

Are you saying that the memory increases while running decode on a large parallel sequence, or that the memory continues increasing in between calls to decode with the parallel sequence?

In other words, when the call to decode ends and you get your result, does the memory usage go back down?

The memory usage increases during large-scale parallel decoding, but it doesn't increase with every decoding operation. Once the memory usage increases, it doesn't decrease even as the number of completed sentences increases. It keeps accumulating.

  1. I have raised a new issue on llama-cpp-python: #924.
  2. I suspect that the issue might be with my code or with the Python C-bindings.
    Lastly, thank you very much for your response. My current work goal is to implement continuous batch processing in Python, so if you have any guesses or insights regarding the potential memory leak, please feel free to let me know.

Additionally, could you provide more detailed comments for the parallel.cpp example? For instance, explaining which parts of the code might have potential issues to facilitate understanding. I'm not proficient in the C++ language, so it would be helpful to have clearer explanations.

@littlebai3618
Copy link
Author

It's my apologies for disturbing you again. @KerfuffleV2 @ggerganov

My English is a little rusty, and I used translation software to help translate some parts.

I think I have found the reason. When I changed the example question in parallel.cpp to a longer prompt of about 1500-2000 tokens, I can consistently reproduce the issue using the following command:

./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16324 -b 4096 --cont_batching --parallel 2 --sequences 600 --n-gpu-layers 1000

Failure Information

GPU memory suddenly increases.
企业微信截图_7e06b5cc-34c6-4c52-8be7-17d4d65889e7
Then the program throws an error.

cuBLAS error 13 at ggml-cuda.cu:6464: the function failed to launch on the GPU
current device: 0

I am running only one task of parallel.cpp on the GPU. Clearly, I have sufficient resources to run it, but the program throws an error.

When running parallel.cpp with a shorter example prompt, the GPU memory does not increase, and the program can terminate without errors.

What configurations should I adjust or what actions should I take to avoid this problem? Or is this a bug?

Steps to Reproduce

Using the latest llama.cpp repository code: commit_id: dae06c0
I am testing with the CodeLlama-7B-hf model:https://huggingface.co/codellama/CodeLlama-7b-hf
convert it:

python convert.py ./CodeLlama-7B/ --outtype q8_0 
  1. Please use my code.
    parallel.cpp.zip
    only change k_prompts、k_system
  2. run LLAMA_CUDA_NVCC=/usr/local/cuda-12/bin/nvcc make LLAMA_CUBLAS=1
  3. exectue:./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16324 -b 4096 --cont_batching --parallel 2 --sequences 600 --n-gpu-layers 1000

@littlebai3618
Copy link
Author

image The same situation is happening on the V100S. I reduced the batch parameter from 4096 to 1024, but the same error still persists.

@KerfuffleV2
Copy link
Collaborator

It's my apologies for disturbing you again.

Sorry, you're not disturbing me but I don't really have much to add at this point. I don't really know enough about your specific problem to say something helpful. 4096 (or even 1024) sounds like a very, very high batch size though. Maybe it's normal for those cards.

Hopefully GG will be able to help you, he wrote the parallel example. (By the way, your English seemed fine to me.)

@littlebai3618 littlebai3618 changed the title llama_kv_cache_seq_rm" method fails to properly release the cache during continuous batch processing? parallel.cpp exits when encountering a long prompt. Nov 22, 2023
@littlebai3618
Copy link
Author

I found that the comments on this pull request are similar to the issue I encountered, but verifying this problem is beyond my ability.

#3776
image
image
image

@ggerganov
Copy link
Owner

Does it work with -b 512 ?

@ggerganov
Copy link
Owner

@littlebai3618 I was able to reproduce the issue and find the root cause.

It is a pathological case of the issue with the CUDA memory pool described here: #3903 (comment)

Combined with an non-optimal KV cache utilization and fragmentation in this specific case, it leads to extra memory allocations at runtime after several sequences are processed.

A quick workaround is the following:

diff --git a/llama.cpp b/llama.cpp
index c2ad0486..5851a2ee 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -5469,6 +5469,8 @@ static int llama_decode_internal(
         batch.seq_id = seq_id_arr.data();
     }
 
+    kv_self.head = 0;
+
     if (!llama_kv_cache_find_slot(kv_self, batch)) {
         return 1;
     }

Can you confirm that fixes the issue on your side?

@ggerganov
Copy link
Owner

Also, try the following branch: #4170

Should resolve the issue in a better way.

@littlebai3618
Copy link
Author

I tested the code for the branch 'kv-cache-opts' on V100S-PCIE-32G. I conducted the test using my modified '
parallel.cpp' (only replaced the longer prompt).

The issue of abnormal memory growth seems to have been resolved.

I also compiled this branch test my python continuous batch code with llama-cpp-python 0.2.19 and I used -n 4096 -c 8162 for two concurrent requests for 1 hour without any errors occurring.

However, I have a new question, which may sound silly, but I'm not familiar with this field. What is the relationship
between the '-b' parameter and the '--parallel' parameter? When the '-b' value is large, setting a very high value for '
--parallel' causes abnormal termination. My expectation is that the speed of individual tasks would decrease but not
crash when there are many batch processing tasks. I believe my available memory is sufficient to handle a larger '
--parallel' parameter.

I will explain why I need such a large batch size and parallelism.
I would like to use llama.cpp for auto code complete like github copilot, which has three requirements:

  1. Decent inference speed per single instance (llama.cpp performs exceptionally well in this aspects).
  2. Maximum possible batch processing capability.
  3. Prompts are usually quite lengthy.

Here are the test results:

-c -n --parallel --sequences memusage
4096 4096 2 60 40.6%
4096 4096 4 60 Segmentation fault (core dumped)
8162 4096 2 60 53.3%
8162 4096 4 60 an illegal memory access was encountered
8162 3072 4 60 an illegal memory access was encountered
8162 2048 4 60 51.3% -> 54.6%
16384 4096 2 60 79.0%
16384 4096 4 60 an illegal memory access was encountered
8162 4096 2 600 53.3

Test detail

1 test: -c 4096 -b 4096 --cont_batching --parallel 2
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 4096 -b 4096  --cont_batching --parallel 2 --sequences 60 --n-gpu-layers 1000

output:

run parameters as at 2023-11-23 03:14:42

main: n_parallel = 2, n_sequences = 60, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used:  ./CodeLlama-7B/ggml-model-q8_0.gguf

Total prompt tokens: 106860, speed: 1207.04 t/s
Total gen tokens:       444, speed:  5.02 t/s
Total speed (AVG):           speed: 1212.05 t/s
Cache misses:           122


llama_print_timings:        load time =    4846.97 ms
llama_print_timings:      sample time =     232.82 ms /   504 runs   (    0.46 ms per token,  2164.77 tokens per second)
llama_print_timings: prompt eval time =   87145.50 ms / 107274 tokens (    0.81 ms per token,  1230.98 tokens per second)
llama_print_timings:        eval time =     480.58 ms /    31 runs   (   15.50 ms per token,    64.51 tokens per second)
llama_print_timings:       total time =   88531.63 ms
2 test: -c 4096 -b 4096 --cont_batching --parallel 4
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 4096 -b 4096  --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000

output:

main: Simulating parallel requests from clients:
main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1

main: Evaluating the system prompt ...

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, started decoding ...
Client   1, seq    1, started decoding ...
Segmentation fault (core dumped)
3 test: -c 8162 -b 4096 --cont_batching --parallel 2
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 4096  --cont_batching --parallel 2 --sequences 60 --n-gpu-layers 1000

output:

run parameters as at 2023-11-23 03:22:20

main: n_parallel = 2, n_sequences = 60, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used:  ./CodeLlama-7B/ggml-model-q8_0.gguf

Total prompt tokens: 106860, speed: 1182.29 t/s
Total gen tokens:       433, speed:  4.79 t/s
Total speed (AVG):           speed: 1187.08 t/s
Cache misses:             0


llama_print_timings:        load time =    6436.07 ms
llama_print_timings:      sample time =     218.26 ms /   493 runs   (    0.44 ms per token,  2258.82 tokens per second)
llama_print_timings: prompt eval time =   89482.04 ms / 107290 tokens (    0.83 ms per token,  1199.01 tokens per second)
llama_print_timings:        eval time =      72.03 ms /     4 runs   (   18.01 ms per token,    55.53 tokens per second)
llama_print_timings:       total time =   90384.80 ms
4 test: -c 8162 -b 4096 --cont_batching --parallel 4
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 4096  --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000

output:

main: Simulating parallel requests from clients:
main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1

main: Evaluating the system prompt ...

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, started decoding ...
Client   1, seq    1, started decoding ...
Client   2, seq    2, started decoding ...
Client   3, seq    3, started decoding ...

CUDA error 700 at ggml-cuda.cu:6951: an illegal memory access was encountered
current device: 0
5 test: -c 8162 -b 3072 --cont_batching --parallel 4
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 3072  --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000

output:

main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1

main: Evaluating the system prompt ...

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, started decoding ...
Client   1, seq    1, started decoding ...
Client   2, seq    2, started decoding ...
Client   3, seq    3, started decoding ...

CUDA error 700 at ggml-cuda.cu:6951: an illegal memory access was encountered
current device: 0
5 test: -c 8162 -b 2048 --cont_batching --parallel 4
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 2048  --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000

output:

run parameters as at 2023-11-23 03:35:29

main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used:  ./CodeLlama-7B/ggml-model-q8_0.gguf

Total prompt tokens: 106860, speed: 875.11 t/s
Total gen tokens:       301, speed:  2.46 t/s
Total speed (AVG):           speed: 877.58 t/s
Cache misses:            65


llama_print_timings:        load time =    5547.56 ms
llama_print_timings:      sample time =     152.90 ms /   361 runs   (    0.42 ms per token,  2361.10 tokens per second)
llama_print_timings: prompt eval time =  121359.56 ms / 107159 tokens (    1.13 ms per token,   882.99 tokens per second)
llama_print_timings:        eval time =      48.57 ms /     3 runs   (   16.19 ms per token,    61.77 tokens per second)
llama_print_timings:       total time =  122110.98 ms
6 test: -c 16384 -b 4096 --cont_batching --parallel 2
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16384 -b 4096  --cont_batching --parallel 2 --sequences 60 --n-gpu-layers 1000

output:

run parameters as at 2023-11-23 03:30:15

main: n_parallel = 2, n_sequences = 60, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used:  ./CodeLlama-7B/ggml-model-q8_0.gguf

Total prompt tokens: 106860, speed: 1105.28 t/s
Total gen tokens:       422, speed:  4.36 t/s
Total speed (AVG):           speed: 1109.64 t/s
Cache misses:             0


llama_print_timings:        load time =   11462.06 ms
llama_print_timings:      sample time =     218.06 ms /   482 runs   (    0.45 ms per token,  2210.38 tokens per second)
llama_print_timings: prompt eval time =   95798.40 ms / 107280 tokens (    0.89 ms per token,  1119.85 tokens per second)
llama_print_timings:        eval time =      49.38 ms /     3 runs   (   16.46 ms per token,    60.76 tokens per second)
llama_print_timings:       total time =   96682.93 ms
7 test: -c 16384 -b 4096 --cont_batching --parallel 4
./parallel -m ./CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 16384 -b 4096  --cont_batching --parallel 4 --sequences 60 --n-gpu-layers 1000

output:

main: Simulating parallel requests from clients:
main: n_parallel = 4, n_sequences = 60, cont_batching = 1, system tokens = 1

main: Evaluating the system prompt ...

Processing requests ...

main: clearing the KV cache
Client   0, seq    0, started decoding ...
Client   1, seq    1, started decoding ...
Client   2, seq    2, started decoding ...
Client   3, seq    3, started decoding ...

CUDA error 700 at ggml-cuda.cu:6951: an illegal memory access was encountered
current device: 0
8 test: -c 8162 -b 4096 --cont_batching --parallel 2
./parallel -m /aistudio/workspace/system-default/models/CodeLlama-7B/ggml-model-q8_0.gguf -n -1 -c 8162 -b 4096  --cont_batching --parallel 2 --sequences 600 --n-gpu-layers 1000

output:

run parameters as at 2023-11-23 04:15:22

main: n_parallel = 2, n_sequences = 600, cont_batching = 1, system tokens = 1
External prompt file: used built-in defaults
Model and path used:  /aistudio/workspace/system-default/models/CodeLlama-7B/ggml-model-q8_0.gguf

Total prompt tokens: 1068600, speed: 1175.07 t/s
Total gen tokens:      4699, speed:  5.17 t/s
Total speed (AVG):           speed: 1180.23 t/s
Cache misses:             0


llama_print_timings:        load time =    6629.32 ms
llama_print_timings:      sample time =    2562.92 ms /  5299 runs   (    0.48 ms per token,  2067.57 tokens per second)
llama_print_timings: prompt eval time =  900202.27 ms / 1073290 tokens (    0.84 ms per token,  1192.28 tokens per second)
llama_print_timings:        eval time =     166.83 ms /    10 runs   (   16.68 ms per token,    59.94 tokens per second)
llama_print_timings:       total time =  909396.98 ms

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Nov 23, 2023

@littlebai3618

However, I have a new question, which may sound silly, but I'm not familiar with this field. What is the relationship between the '-b' parameter and the '--parallel' parameter?

I wouldn't say it's a silly question. -b sets the maximum batch size. That is, how many tokens will get submitted to the LLM in a single call to decode. For example, if you have -b 256 and a prompt that's 514 tokens then those tokens will initially get submitted in three batches: 256, 256, 2. Setting -b high doesn't necessarily actually have an effect. You can set -b 1000000 and then submit a prompt that's 20 tokens long and the batch will only have a size of 20.

For adding parallel sequences to the mix there are two scenarios: The sequences all share a prompt or (at least some of them do) have their own prompts. In the first scenario, suppose you set -b 4000 --parallel 64 and your prompt is 500 tokens. Your first batch will be 500, and then after that you'll be submitting batches of 64 (one for each sequence, feeding it the token that was sampled for that sequence in the previous step). It won't matter that you set a super high batch size limit.

On the other hand, if each sequence has a unique prompt and you use those same settings and prompt size (64 sequences, each with 500 tokens of unique prompt) then you have 32,000 tokens to evaluate at the start (500 * 64). Now you actually will be submitting batches of 4,000 initially. Once the prompt tokens have all been evaluated, then you'll be back to only submitting batches equal to the size of the number of sequences: so 64.

Hope this helps explain it. I think the large number of sequences combined with unique prompts is the main way you'd run into a case where setting a very high batch size matters. The other scenario is of course if you prompt is just greater or equal to the batch size.

@littlebai3618
Copy link
Author

In my setup, I set -n 4096. If I input two sequences, one with 2500 tokens and the other with 2000 tokens, based on my understanding, do I need to execute the llama_decode method twice? Once with 4096 tokens and once with 404 tokens, is that correct? However, when I input it this way, it results in an "illegal memory access" error. I am unsure if this is normal or if it's a bug, or if the value of n is too large for my hardware.

@KerfuffleV2
Copy link
Collaborator

@littlebai3618

In my setup, I set -n 4096.

-n isn't the same thing as -b. -n is the number of tokens to generate, -b is the batch size limit. I'm not sure if you meant to actually write -b there?

If you meant you set -b 4096 and you have two sequences one with 2,500 tokens and the one with 404 tokens then yes, that will get split into two batches: one with 4096 tokens and one with 404 tokens.

I don't really know what a reasonable value for -b is with your hardware. For normal consumer grade GPUs -b 4096 sounds extremely high to me. I'd suggest trying the default of -b 512 and increasing it in relatively small steps (like 512, 768, 1024, 1280) and see if increasing it is actually also increasing performance. If not, then you may as well leave it at a relatively low value like 512.

It's also something that's pretty much only going to have an effect during prompt processing, since you probably aren't going to actually be doing generation with 4,000+ parallel sequences.

@ggerganov
Copy link
Owner

@littlebai3618 Currently, there is no reason to use -b larger than 512 because all kernels perform worse for larger values - we have optimized for 512. Maybe in the future, if we implement new kernels that can be efficient with larger batch size, it will make sense to increase it, but for now - there is no point of doing that.

Using -b 512 means that however many tokens you submit for processing, the logic in parallel.cpp will chunk it in batches of at most 512 tokens as explained by @KerfuffleV2:

// process in chunks of params.n_batch
int32_t n_batch = params.n_batch;
for (int32_t i = 0; i < (int32_t) batch.n_tokens; i += n_batch) {
// experiment: process in powers of 2
//if (i + n_batch > (int32_t) batch.n_tokens && n_batch > 32) {
// n_batch /= 2;
// i -= n_batch;
// continue;
//}
const int32_t n_tokens = std::min(n_batch, (int32_t) (batch.n_tokens - i));
llama_batch batch_view = {
n_tokens,
batch.token + i,
nullptr,
batch.pos + i,
batch.n_seq_id + i,
batch.seq_id + i,
batch.logits + i,
0, 0, 0, // unused
};
const int ret = llama_decode(ctx, batch_view);
if (ret != 0) {
if (n_batch == 1 || ret < 0) {
// if you get here, it means the KV cache is full - try increasing it via the context size
LOG_TEE("%s : failed to decode the batch, n_batch = %d, ret = %d\n", __func__, n_batch, ret);
return 1;
}
LOG("%s : failed to decode the batch, retrying with n_batch = %d\n", __func__, n_batch / 2);
n_cache_miss += 1;
// retry with half the batch size to try to find a free slot in the KV cache
n_batch /= 2;
i -= n_batch;
continue;
}

Thanks for looking into the paralle.cpp example and let us know your further results and observations - these are very helpful.

@littlebai3618
Copy link
Author

Do you need me to provide the results obtained by running with -n 512, or are these results sufficient? I didn't understand the meaning of "batch" correctly before. I need to do some further validation before closing this issue. By the way, will this branch be merged soon?

@KerfuffleV2
Copy link
Collaborator

Do you need me to provide the results obtained by running with -n 512, or are these results sufficient?

I think it would be useful to see if you can still reproduce your original problem with a more normal -b. I guess that would mean using master rather than the the pull to test with.

@ggerganov
Copy link
Owner

You are confusing the two parameters:

  • -n max number of tokens to generate for each sequence
  • -b the number of tokens to decode for each llama_decode() call

You just have to set -b 512 and don't change it.

-n depends on your application, but it can be -1 - sequences will generate until EOT token or if the following criteria is satisfied:

if (client.n_decoded > 2 &&
(id == llama_token_eos(model) ||
(params.n_predict > 0 && client.n_decoded + client.n_prompt >= params.n_predict) ||
client.response.find("User:") != std::string::npos ||
client.response.find('\n') != std::string::npos)) {

You can adapt it to your needs.

@littlebai3618
Copy link
Author

I tested the code for the branch 'kv-cache-opts' on V100S-PCIE-32G. I conducted the test using my modified '
parallel.cpp' (only replaced the longer prompt).

Here are the test results:

kv-cache-opts:

-c -n --parallel --sequences memusage
4096 512 2 60 31.0%
4096 512 4 60 Segmentation fault (core dumped)
8162 512 2 60 38.1%
8162 512 4 60 38.1%
16384 512 2 60 52.4%
16384 512 4 60 52.4% -> 55.2%
8162 512 2 600 38.1%

master: d103d93

-c -n --parallel --sequences memusage
4096 512 2 60 31.0%
4096 512 4 60 Segmentation fault (core dumped)
8162 512 2 60 38.1%
8162 512 4 60 38.1
16384 512 2 60 52.4%
16384 512 4 60 52.4% -> 59.6% -> 63.9% -> 65.6%
8162 512 2 600 38.1%

When using -c 16384, the memory usage gradually increases from 52.4% to 59.6%, then to 63.9%, and finally to 65.6%. However, it does not appear to crash.

I have only one sequence in my use case, so it seems that I cannot trigger the fragmentation issue of the KV cache. I believe the problems I encountered before were due to my incorrect usage of the -n parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants