llama : add batched inference endpoint to server #3478

niubi-AI · 2023-10-04T19:10:07Z

for those not familiar with C like me.
it would be great if a new endpoint added to server.cpp to make batch inference.
for example:
endpoint: /completions
post: {"prompts":["promptA","promptB","promptC"]}
response:{"results":["sequenceA","sequenceB","sequenceC"]}

it is easy to do so with Hugging Face Transformers (as i do right now), but it's quite inefficient，hope to use llama.cpp to increase the efficiency oneday, cause I am not familiar with C, so can not use baby llama. I can only use javascript to Interact data with server.cpp。

PenutChen · 2023-10-05T01:32:21Z

Since there are many efficient quantization levels in llama.cpp, adding batch inference and continuous batching to the server will make it highly competitive with other inference frameworks like vllm or hf-tgi.

niubi-AI · 2023-10-05T10:10:08Z

Since there are many efficient quantization levels in llama.cpp, adding batch inference and continuous batching to the server will make it highly competitive with other inference frameworks like vllm or hf-tgi.

Yes, vllm and agi seem to be not available on windows。with transformers a batch of 10 sequences costs about 25 seconds, i think it would just costs 15 seconds if with llama.cpp. i have no idea cause i have not tested successfully.

IridiumMaster · 2023-10-11T04:03:57Z

I would also be interested in this one

ggerganov · 2023-10-24T16:38:46Z

Fixed via #3589 #3677

yanndupis · 2023-11-22T02:56:10Z

Thank you @ggerganov for adding this feature.

Maybe I am missing something; when I give an array of prompts to the request as described in the README, I will get the response only for the last element of the array instead of for each prompt in the array. Here is an example to reproduce:

Server:

./server --model mistral-7b-instruct-v0.1.Q4_0.gguf --port 8080 -c 8092 -cb

Client:

curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": ["<s>[INST] What is the capital of the US? [/INST]", "<s>[INST] What is the capital of France? [/INST]"], "n_predict": 2048}'

Output:

{"content":" The capital of France is Paris.","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"mistral-7b-instruct-v0.1.Q4_0.gguf","n_ctx":8092,"n_keep":0,"n_predict":2048,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"mistral-7b-instruct-v0.1.Q4_0.gguf","prompt":["<s>[INST] What is the capital of the US? [/INST]","<s>[INST] What is the capital of France? [/INST]"],"slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":167.096,"predicted_n":7,"predicted_per_second":41.892085986498785,"predicted_per_token_ms":23.870857142857144,"prompt_ms":1600.153,"prompt_n":36,"prompt_per_second":22.49784864322349,"prompt_per_token_ms":44.44869444444444},"tokens_cached":43,"tokens_evaluated":36,"tokens_predicted":7,"truncated":false}

I am on main (8e672ef)

Any idea? thank you!

brucethemoose · 2023-11-22T21:13:58Z

> curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": ["<s>[INST] What is the capital of the US? [/INST]", "<s>[INST] What is the capital of France? [/INST]"], "n_predict": 2048}'

Did you ever solve this? I'm running into the same issue, I assume its a formatting error?

ggerganov · 2023-11-23T09:20:19Z

Ah, I think I got confused. We solved serving clients in parallel, but not processing prompts in parallel.
A workaround is to submit the prompts in separate requests.

yanndupis · 2023-11-27T17:53:14Z

Ah, I think I got confused. We solved serving clients in parallel but not processing prompts in parallel.
A workaround is to submit the prompts in separate requests.

Thank you for your response. Makes sense. Last question. I ran some benchmarks early last week using the workaround you described (submitting prompts in separate requests). The benchmarks were done on CPU only with OPENBLAS with a c5.4xlarge and m6i.12xlarge. I found it was faster or equivalent to run requests in sequence rather than running them in parallel. Do you think it's expected on CPU only, or did I miss some configurations?

I observed the following results running a Mistral model as follow:

./server --model mistral-7b-instruct-v0.1.Q4_0.gguf --port 8080  -np <nb parallel requests> -cb --ctx-size 8092

Thanks again.

ggerganov · 2023-11-27T18:00:43Z

@yanndupis

I think for quantum models, using OpenBLAS should be slower.
Could you post the results from the following commands:

make clean && make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4

make clean && LLAMA_OPENBLAS=1 make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4

yanndupis · 2023-11-27T22:08:02Z

@ggerganov

Here are the results with a c5.4xlarge instance.

make clean && make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4


|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |     64 |    1 |    192 |    9.134 |    14.01 |    6.032 |    10.61 |   15.166 |    12.66 |
|   128 |     64 |    2 |    384 |   18.403 |    13.91 |   10.567 |    12.11 |   28.970 |    13.26 |
|   128 |     64 |    3 |    576 |   27.853 |    13.79 |   15.411 |    12.46 |   43.264 |    13.31 |
|   128 |     64 |    4 |    768 |   37.396 |    13.69 |   20.212 |    12.67 |   57.607 |    13.33 |
|   128 |    128 |    1 |    256 |    9.119 |    14.04 |   12.110 |    10.57 |   21.229 |    12.06 |
|   128 |    128 |    2 |    512 |   18.384 |    13.92 |   21.292 |    12.02 |   39.676 |    12.90 |
|   128 |    128 |    3 |    768 |   27.812 |    13.81 |   30.970 |    12.40 |   58.783 |    13.07 |
|   128 |    128 |    4 |   1024 |   37.379 |    13.70 |   40.625 |    12.60 |   78.005 |    13.13 |
|   128 |    256 |    1 |    384 |    9.121 |    14.03 |   24.325 |    10.52 |   33.446 |    11.48 |
|   128 |    256 |    2 |    768 |   18.385 |    13.92 |   42.826 |    11.96 |   61.212 |    12.55 |
|   128 |    256 |    3 |   1152 |   27.812 |    13.81 |   62.700 |    12.25 |   90.512 |    12.73 |
|   128 |    256 |    4 |   1536 |   37.365 |    13.70 |   82.518 |    12.41 |  119.883 |    12.81 |
|   256 |     64 |    1 |    320 |   18.373 |    13.93 |    6.100 |    10.49 |   24.473 |    13.08 |
|   256 |     64 |    2 |    640 |   37.377 |    13.70 |   10.739 |    11.92 |   48.116 |    13.30 |
|   256 |     64 |    3 |    960 |   56.377 |    13.62 |   15.807 |    12.15 |   72.184 |    13.30 |
|   256 |     64 |    4 |   1280 |   75.979 |    13.48 |   20.796 |    12.31 |   96.775 |    13.23 |
|   256 |    128 |    1 |    384 |   18.377 |    13.93 |   12.262 |    10.44 |   30.638 |    12.53 |
|   256 |    128 |    2 |    768 |   37.374 |    13.70 |   21.536 |    11.89 |   58.910 |    13.04 |
|   256 |    128 |    3 |   1152 |   56.354 |    13.63 |   31.689 |    12.12 |   88.042 |    13.08 |
|   256 |    128 |    4 |   1536 |   75.947 |    13.48 |   41.910 |    12.22 |  117.856 |    13.03 |
|   256 |    256 |    1 |    512 |   18.388 |    13.92 |   24.531 |    10.44 |   42.919 |    11.93 |
|   256 |    256 |    2 |   1024 |   37.364 |    13.70 |   43.535 |    11.76 |   80.899 |    12.66 |
|   256 |    256 |    3 |   1536 |   56.337 |    13.63 |   64.210 |    11.96 |  120.547 |    12.74 |
|   256 |    256 |    4 |   2048 |   75.977 |    13.48 |   85.182 |    12.02 |  161.159 |    12.71 |
|   512 |     64 |    1 |    576 |   37.360 |    13.70 |    6.212 |    10.30 |   43.572 |    13.22 |
|   512 |     64 |    2 |   1152 |   75.952 |    13.48 |   11.064 |    11.57 |   87.015 |    13.24 |
|   512 |     64 |    3 |   1728 |  115.824 |    13.26 |   16.481 |    11.65 |  132.306 |    13.06 |
|   512 |     64 |    4 |   2304 |  156.991 |    13.05 |   21.984 |    11.64 |  178.976 |    12.87 |
|   512 |    128 |    1 |    640 |   37.355 |    13.71 |   12.377 |    10.34 |   49.733 |    12.87 |
|   512 |    128 |    2 |   1280 |   75.980 |    13.48 |   22.245 |    11.51 |   98.225 |    13.03 |
|   512 |    128 |    3 |   1920 |  115.843 |    13.26 |   33.178 |    11.57 |  149.021 |    12.88 |
|   512 |    128 |    4 |   2560 |  156.904 |    13.05 |   44.468 |    11.51 |  201.372 |    12.71 |
|   512 |    256 |    1 |    768 |   37.368 |    13.70 |   25.535 |    10.03 |   62.903 |    12.21 |
|   512 |    256 |    2 |   1536 |   76.349 |    13.41 |   45.810 |    11.18 |  122.159 |    12.57 |
|   512 |    256 |    3 |   2304 |  116.028 |    13.24 |   67.129 |    11.44 |  183.157 |    12.58 |
|   512 |    256 |    4 |   3072 |  156.851 |    13.06 |   89.953 |    11.38 |  246.804 |    12.45 |

make clean && LLAMA_OPENBLAS=1 make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |     64 |    1 |    192 |   18.174 |     7.04 |    6.118 |    10.46 |   24.292 |     7.90 |
|   128 |     64 |    2 |    384 |   36.632 |     6.99 |   10.612 |    12.06 |   47.244 |     8.13 |
|   128 |     64 |    3 |    576 |   55.341 |     6.94 |   15.441 |    12.43 |   70.783 |     8.14 |
|   128 |     64 |    4 |    768 |   74.304 |     6.89 |   20.213 |    12.66 |   94.517 |     8.13 |
|   128 |    128 |    1 |    256 |   18.149 |     7.05 |   12.244 |    10.45 |   30.392 |     8.42 |
|   128 |    128 |    2 |    512 |   36.582 |     7.00 |   21.313 |    12.01 |   57.894 |     8.84 |
|   128 |    128 |    3 |    768 |   55.278 |     6.95 |   31.022 |    12.38 |   86.300 |     8.90 |
|   128 |    128 |    4 |   1024 |   74.250 |     6.90 |   40.818 |    12.54 |  115.069 |     8.90 |
|   128 |    256 |    1 |    384 |   18.177 |     7.04 |   24.641 |    10.39 |   42.818 |     8.97 |
|   128 |    256 |    2 |    768 |   36.531 |     7.01 |   42.994 |    11.91 |   79.524 |     9.66 |
|   128 |    256 |    3 |   1152 |   55.250 |     6.95 |   62.940 |    12.20 |  118.190 |     9.75 |
|   128 |    256 |    4 |   1536 |   74.210 |     6.90 |   82.951 |    12.34 |  157.161 |     9.77 |
|   256 |     64 |    1 |    320 |   36.591 |     7.00 |    6.152 |    10.40 |   42.743 |     7.49 |
|   256 |     64 |    2 |    640 |   74.140 |     6.91 |   10.775 |    11.88 |   84.915 |     7.54 |
|   256 |     64 |    3 |    960 |  111.936 |     6.86 |   15.903 |    12.07 |  127.839 |     7.51 |
|   256 |     64 |    4 |   1280 |  150.739 |     6.79 |   20.831 |    12.29 |  171.570 |     7.46 |
|   256 |    128 |    1 |    384 |   36.584 |     7.00 |   12.359 |    10.36 |   48.943 |     7.85 |
|   256 |    128 |    2 |    768 |   74.294 |     6.89 |   21.733 |    11.78 |   96.027 |     8.00 |
|   256 |    128 |    3 |   1152 |  112.010 |     6.86 |   31.817 |    12.07 |  143.826 |     8.01 |
|   256 |    128 |    4 |   1536 |  150.994 |     6.78 |   42.019 |    12.18 |  193.013 |     7.96 |
|   256 |    256 |    1 |    512 |   36.614 |     6.99 |   24.766 |    10.34 |   61.380 |     8.34 |
|   256 |    256 |    2 |   1024 |   74.287 |     6.89 |   43.797 |    11.69 |  118.085 |     8.67 |
|   256 |    256 |    3 |   1536 |  112.089 |     6.85 |   64.478 |    11.91 |  176.566 |     8.70 |
|   256 |    256 |    4 |   2048 |  150.856 |     6.79 |   85.181 |    12.02 |  236.037 |     8.68 |
|   512 |     64 |    1 |    576 |   74.310 |     6.89 |    6.246 |    10.25 |   80.556 |     7.15 |
|   512 |     64 |    2 |   1152 |  150.789 |     6.79 |   11.167 |    11.46 |  161.956 |     7.11 |
|   512 |     64 |    3 |   1728 |  229.332 |     6.70 |   16.562 |    11.59 |  245.894 |     7.03 |
|   512 |     64 |    4 |   2304 |  310.210 |     6.60 |   22.083 |    11.59 |  332.294 |     6.93 |
|   512 |    128 |    1 |    640 |   74.204 |     6.90 |   12.524 |    10.22 |   86.728 |     7.38 |
|   512 |    128 |    2 |   1280 |  150.649 |     6.80 |   22.385 |    11.44 |  173.034 |     7.40 |
|   512 |    128 |    3 |   1920 |  229.219 |     6.70 |   33.261 |    11.54 |  262.480 |     7.31 |
|   512 |    128 |    4 |   2560 |  310.168 |     6.60 |   44.466 |    11.51 |  354.634 |     7.22 |
|   512 |    256 |    1 |    768 |   74.255 |     6.90 |   25.159 |    10.18 |   99.414 |     7.73 |
|   512 |    256 |    2 |   1536 |  150.639 |     6.80 |   45.051 |    11.36 |  195.690 |     7.85 |
|   512 |    256 |    3 |   2304 |  229.249 |     6.70 |   67.291 |    11.41 |  296.540 |     7.77 |
|   512 |    256 |    4 |   3072 |  310.155 |     6.60 |   90.179 |    11.36 |  400.334 |     7.67 |

For the use case I was benchmarking, my prompt was much longer than the generated response, so it might be similar to the scenario PP=256, TG=64 or PP=512, TG=64. I didn't realize it would be overall slower for a quantized model with OpenBLAS.

ggerganov · 2023-11-27T22:22:27Z

You can try the OpenBLAS bench with this PR: #4240 because currently on master it is not being called at all. But my expectation is that it still would be slower, because with quantum models we need to dequantize to F32 which is expensive.

So based on the results, indeed the performance does not scale with more batches on these machines (the TG speed is roughly the same across B).
Can you also run the following memcpy benchmark to check what is the memory bandwidth:

git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp && make -j bench && ./bench -w 1 -t 8

It can take about a minute or two to run.

yanndupis · 2023-11-27T22:49:43Z

Thanks for the explanation @ggerganov and for continuing to look into it; it's super helpful.

Here is the output using the same instance:

git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp && make -j bench && ./bench -w 1 -t 8

memcpy:    5.46 GB/s (heat-up)
memcpy:    5.47 GB/s ( 1 thread)
memcpy:    5.46 GB/s ( 1 thread)
memcpy:   10.70 GB/s ( 2 thread)
memcpy:   15.60 GB/s ( 3 thread)
memcpy:   21.20 GB/s ( 4 thread)
memcpy:   26.38 GB/s ( 5 thread)
memcpy:   30.72 GB/s ( 6 thread)
memcpy:   30.94 GB/s ( 7 thread)
memcpy:   40.19 GB/s ( 8 thread)
sum:    -5119996794.000000

yanndupis · 2023-11-28T00:48:01Z

And here is the output with OpenBLAS bench using the PR: #4240. The results definitely look better.

make clean && LLAMA_OPENBLAS=1 make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |     64 |    1 |    192 |   22.274 |     5.75 |    6.203 |    10.32 |   28.477 |     6.74 |
|   128 |     64 |    2 |    384 |   25.250 |    10.14 |   10.749 |    11.91 |   36.000 |    10.67 |
|   128 |     64 |    3 |    576 |   28.381 |    13.53 |   15.625 |    12.29 |   44.006 |    13.09 |
|   128 |     64 |    4 |    768 |   32.481 |    15.76 |   20.318 |    12.60 |   52.799 |    14.55 |
|   128 |    128 |    1 |    256 |   21.942 |     5.83 |   12.378 |    10.34 |   34.320 |     7.46 |
|   128 |    128 |    2 |    512 |   25.361 |    10.09 |   21.520 |    11.90 |   46.882 |    10.92 |
|   128 |    128 |    3 |    768 |   28.353 |    13.54 |   31.315 |    12.26 |   59.668 |    12.87 |
|   128 |    128 |    4 |   1024 |   32.247 |    15.88 |   41.003 |    12.49 |   73.249 |    13.98 |
|   128 |    256 |    1 |    384 |   22.347 |     5.73 |   24.747 |    10.34 |   47.094 |     8.15 |
|   128 |    256 |    2 |    768 |   24.880 |    10.29 |   43.299 |    11.82 |   68.178 |    11.26 |
|   128 |    256 |    3 |   1152 |   28.059 |    13.69 |   63.231 |    12.15 |   91.290 |    12.62 |
|   128 |    256 |    4 |   1536 |   32.309 |    15.85 |   83.177 |    12.31 |  115.486 |    13.30 |
|   256 |     64 |    1 |    320 |   25.565 |    10.01 |    6.271 |    10.21 |   31.836 |    10.05 |
|   256 |     64 |    2 |    640 |   32.120 |    15.94 |   10.925 |    11.72 |   43.045 |    14.87 |
|   256 |     64 |    3 |    960 |   59.151 |    12.98 |   16.043 |    11.97 |   75.195 |    12.77 |
|   256 |     64 |    4 |   1280 |   66.605 |    15.37 |   21.009 |    12.19 |   87.614 |    14.61 |
|   256 |    128 |    1 |    384 |   25.329 |    10.11 |   12.443 |    10.29 |   37.773 |    10.17 |
|   256 |    128 |    2 |    768 |   32.508 |    15.75 |   21.944 |    11.67 |   54.452 |    14.10 |
|   256 |    128 |    3 |   1152 |   59.150 |    12.98 |   32.070 |    11.97 |   91.220 |    12.63 |
|   256 |    128 |    4 |   1536 |   66.429 |    15.41 |   42.258 |    12.12 |  108.687 |    14.13 |
|   256 |    256 |    1 |    512 |   25.014 |    10.23 |   24.961 |    10.26 |   49.976 |    10.24 |
|   256 |    256 |    2 |   1024 |   32.018 |    15.99 |   44.029 |    11.63 |   76.047 |    13.47 |
|   256 |    256 |    3 |   1536 |   59.238 |    12.96 |   64.758 |    11.86 |  123.996 |    12.39 |
|   256 |    256 |    4 |   2048 |   66.678 |    15.36 |   85.529 |    11.97 |  152.207 |    13.46 |
|   512 |     64 |    1 |    576 |   32.695 |    15.66 |    6.345 |    10.09 |   39.040 |    14.75 |
|   512 |     64 |    2 |   1152 |   66.773 |    15.34 |   11.294 |    11.33 |   78.067 |    14.76 |
|   512 |     64 |    3 |   1728 |  105.135 |    14.61 |   16.727 |    11.48 |  121.862 |    14.18 |
|   512 |     64 |    4 |   2304 |  141.769 |    14.45 |   22.249 |    11.51 |  164.018 |    14.05 |
|   512 |    128 |    1 |    640 |   32.249 |    15.88 |   12.661 |    10.11 |   44.910 |    14.25 |
|   512 |    128 |    2 |   1280 |   66.232 |    15.46 |   22.596 |    11.33 |   88.828 |    14.41 |
|   512 |    128 |    3 |   1920 |  103.211 |    14.88 |   33.480 |    11.47 |  136.691 |    14.05 |
|   512 |    128 |    4 |   2560 |  142.096 |    14.41 |   44.732 |    11.45 |  186.828 |    13.70 |
|   512 |    256 |    1 |    768 |   32.677 |    15.67 |   25.374 |    10.09 |   58.052 |    13.23 |
|   512 |    256 |    2 |   1536 |   66.787 |    15.33 |   45.389 |    11.28 |  112.176 |    13.69 |
|   512 |    256 |    3 |   2304 |  103.819 |    14.80 |   67.621 |    11.36 |  171.440 |    13.44 |
|   512 |    256 |    4 |   3072 |  141.340 |    14.49 |   90.657 |    11.30 |  231.997 |    13.24 |

ggerganov · 2023-11-28T09:10:21Z

If you disable mmap and use 16 threads without OpenBLAS, it seems you can get the best performance on that instance:

diff --git a/examples/batched-bench/batched-bench.cpp b/examples/batched-bench/batched-bench.cpp
index 533c55c..277c901 100644
--- a/examples/batched-bench/batched-bench.cpp
+++ b/examples/batched-bench/batched-bench.cpp
@@ -89,6 +89,7 @@ int main(int argc, char ** argv) {
     llama_model_params model_params = llama_model_default_params();
 
     model_params.n_gpu_layers = n_gpu_layers;
+    model_params.use_mmap = false;
 
     llama_model * model = llama_load_model_from_file(params.model.c_str(), model_params);
 
@@ -104,8 +105,8 @@ int main(int argc, char ** argv) {
     ctx_params.n_batch   = 512;
     ctx_params.mul_mat_q = mmq;
 
-    ctx_params.n_threads       = params.n_threads;
-    ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
+    ctx_params.n_threads       = 16;
+    ctx_params.n_threads_batch = 16;
 
     llama_context * ctx = llama_new_context_with_model(model, ctx_params);

make -j batched-bench && ./batched-bench ./models/openhermes-2.5-mistral-7b.Q4_0.gguf 8192 0 0 0 256 64 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   256 |     64 |    1 |    320 |   16.815 |    15.22 |    4.977 |    12.86 |   21.792 |    14.68 |
|   256 |     64 |    2 |    640 |   34.168 |    14.98 |    9.319 |    13.74 |   43.486 |    14.72 |
|   256 |     64 |    3 |    960 |   51.462 |    14.92 |   13.814 |    13.90 |   65.276 |    14.71 |
|   256 |     64 |    4 |   1280 |   69.378 |    14.76 |   18.535 |    13.81 |   87.912 |    14.56 |

Though I would have expected it to scale better with the batch size. Not sure -- maybe I'm still missing something.

Btw, I also tried a similar Arm-based instance: c6gn.4xlarge instance and it seems faster:

make -j batched-bench && ./batched-bench ./models/openhermes-2.5-mistral-7b.Q4_0.gguf 8192 0 0 0 256 64 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   256 |     64 |    1 |    320 |   12.630 |    20.27 |    3.608 |    17.74 |   16.238 |    19.71 |
|   256 |     64 |    2 |    640 |   25.660 |    19.95 |    7.022 |    18.23 |   32.683 |    19.58 |
|   256 |     64 |    3 |    960 |   38.759 |    19.81 |   10.527 |    18.24 |   49.286 |    19.48 |
|   256 |     64 |    4 |   1280 |   52.128 |    19.64 |   14.108 |    18.15 |   66.236 |    19.32 |

For comparison, here is how it scales on my AMD Ryzen 9 5950X 16-Core Processor at home:

make -j batched-bench && ./batched-bench ./models/openhermes-2.5-mistral-7b.Q4_0.gguf 8192 0 0 0 256 64 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   256 |     64 |    1 |    320 |    5.474 |    46.77 |    9.402 |     6.81 |   14.876 |    21.51 |
|   256 |     64 |    2 |    640 |   11.282 |    45.38 |    9.494 |    13.48 |   20.776 |    30.80 |
|   256 |     64 |    3 |    960 |   16.953 |    45.30 |    9.932 |    19.33 |   26.884 |    35.71 |
|   256 |     64 |    4 |   1280 |   23.334 |    43.88 |   10.460 |    24.47 |   33.794 |    37.88 |

Note how the TG time for 1,2,3,4 batches is almost constant - this is what we normally want.

yanndupis · 2023-11-28T17:55:16Z

Excellent, thank you @ggerganov, for sharing these findings. I will then focus my efforts on Arm-bases instances.

staviq added enhancement New feature or request server/webui labels Oct 6, 2023

ggerganov changed the title ~~will a batch inference endpoint be added to server.cpp?~~ llama : add batched inference endpoint to server Oct 11, 2023

ggerganov added this to ggml : roadmap Oct 11, 2023

ggerganov moved this to Todo in ggml : roadmap Oct 11, 2023

ggerganov added the help wanted Extra attention is needed label Oct 11, 2023

yourbuddyconner mentioned this issue Oct 13, 2023

add support for batched inference abetlen/llama-cpp-python#818

Closed

ggerganov moved this from Todo to In Progress in ggml : roadmap Oct 18, 2023

ggerganov closed this as completed Oct 24, 2023

ggerganov moved this from In Progress to Done in ggml : roadmap Oct 24, 2023

ggerganov mentioned this issue Nov 25, 2023

server : improvements and maintenance #4216

Open

10 tasks

ziedbha mentioned this issue Nov 27, 2023

Add single-client multi-prompt support #4232

Merged

ggerganov mentioned this issue Nov 29, 2023

[Feature Request] parallel decoding on mobile #4064

Closed

4 tasks

nguyenhoangthuan99 mentioned this issue Jul 30, 2024

Discussion: batching benchmark and improvement janhq/cortex.llamacpp#164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add batched inference endpoint to server #3478

llama : add batched inference endpoint to server #3478

niubi-AI commented Oct 4, 2023

PenutChen commented Oct 5, 2023

niubi-AI commented Oct 5, 2023

IridiumMaster commented Oct 11, 2023

ggerganov commented Oct 24, 2023

yanndupis commented Nov 22, 2023

brucethemoose commented Nov 22, 2023 •

edited

Loading

ggerganov commented Nov 23, 2023

yanndupis commented Nov 27, 2023

ggerganov commented Nov 27, 2023

yanndupis commented Nov 27, 2023

ggerganov commented Nov 27, 2023

yanndupis commented Nov 27, 2023

yanndupis commented Nov 28, 2023

ggerganov commented Nov 28, 2023

yanndupis commented Nov 28, 2023

llama : add batched inference endpoint to server #3478

llama : add batched inference endpoint to server #3478

Comments

niubi-AI commented Oct 4, 2023

PenutChen commented Oct 5, 2023

niubi-AI commented Oct 5, 2023

IridiumMaster commented Oct 11, 2023

ggerganov commented Oct 24, 2023

yanndupis commented Nov 22, 2023

brucethemoose commented Nov 22, 2023 • edited Loading

ggerganov commented Nov 23, 2023

yanndupis commented Nov 27, 2023

ggerganov commented Nov 27, 2023

yanndupis commented Nov 27, 2023

ggerganov commented Nov 27, 2023

yanndupis commented Nov 27, 2023

yanndupis commented Nov 28, 2023

ggerganov commented Nov 28, 2023

yanndupis commented Nov 28, 2023

brucethemoose commented Nov 22, 2023 •

edited

Loading