Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add batched inference endpoint to server #3478

Closed
niubi-AI opened this issue Oct 4, 2023 · 15 comments
Closed

llama : add batched inference endpoint to server #3478

niubi-AI opened this issue Oct 4, 2023 · 15 comments
Labels
enhancement New feature or request help wanted Extra attention is needed server/webui

Comments

@niubi-AI
Copy link

niubi-AI commented Oct 4, 2023

for those not familiar with C like me.
it would be great if a new endpoint added to server.cpp to make batch inference.
for example:
endpoint: /completions
post: {"prompts":["promptA","promptB","promptC"]}
response:{"results":["sequenceA","sequenceB","sequenceC"]}

it is easy to do so with Hugging Face Transformers (as i do right now), but it's quite inefficient,hope to use llama.cpp to increase the efficiency oneday, cause I am not familiar with C, so can not use baby llama. I can only use javascript to Interact data with server.cpp。

@PenutChen
Copy link

Since there are many efficient quantization levels in llama.cpp, adding batch inference and continuous batching to the server will make it highly competitive with other inference frameworks like vllm or hf-tgi.

@niubi-AI
Copy link
Author

niubi-AI commented Oct 5, 2023

Since there are many efficient quantization levels in llama.cpp, adding batch inference and continuous batching to the server will make it highly competitive with other inference frameworks like vllm or hf-tgi.

Yes, vllm and agi seem to be not available on windows。with transformers a batch of 10 sequences costs about 25 seconds, i think it would just costs 15 seconds if with llama.cpp. i have no idea cause i have not tested successfully.

@staviq staviq added enhancement New feature or request server/webui labels Oct 6, 2023
@IridiumMaster
Copy link

I would also be interested in this one

@ggerganov ggerganov changed the title will a batch inference endpoint be added to server.cpp? llama : add batched inference endpoint to server Oct 11, 2023
@ggerganov ggerganov added the help wanted Extra attention is needed label Oct 11, 2023
@ggerganov
Copy link
Owner

Fixed via #3589 #3677

@yanndupis
Copy link

Thank you @ggerganov for adding this feature.

Maybe I am missing something; when I give an array of prompts to the request as described in the README, I will get the response only for the last element of the array instead of for each prompt in the array. Here is an example to reproduce:

Server:

./server --model mistral-7b-instruct-v0.1.Q4_0.gguf --port 8080 -c 8092 -cb

Client:

curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": ["<s>[INST] What is the capital of the US? [/INST]", "<s>[INST] What is the capital of France? [/INST]"], "n_predict": 2048}'

Output:

{"content":" The capital of France is Paris.","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"mistral-7b-instruct-v0.1.Q4_0.gguf","n_ctx":8092,"n_keep":0,"n_predict":2048,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"mistral-7b-instruct-v0.1.Q4_0.gguf","prompt":["<s>[INST] What is the capital of the US? [/INST]","<s>[INST] What is the capital of France? [/INST]"],"slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":167.096,"predicted_n":7,"predicted_per_second":41.892085986498785,"predicted_per_token_ms":23.870857142857144,"prompt_ms":1600.153,"prompt_n":36,"prompt_per_second":22.49784864322349,"prompt_per_token_ms":44.44869444444444},"tokens_cached":43,"tokens_evaluated":36,"tokens_predicted":7,"truncated":false}

I am on main (8e672ef)

Any idea? thank you!

@brucethemoose
Copy link

brucethemoose commented Nov 22, 2023

> curl --request POST \
> --url http://localhost:8080/completion \
> --header "Content-Type: application/json" \
> --data '{"prompt": ["<s>[INST] What is the capital of the US? [/INST]", "<s>[INST] What is the capital of France? [/INST]"], "n_predict": 2048}'

Did you ever solve this? I'm running into the same issue, I assume its a formatting error?

@ggerganov
Copy link
Owner

Ah, I think I got confused. We solved serving clients in parallel, but not processing prompts in parallel.
A workaround is to submit the prompts in separate requests.

@yanndupis
Copy link

Ah, I think I got confused. We solved serving clients in parallel but not processing prompts in parallel.
A workaround is to submit the prompts in separate requests.

Thank you for your response. Makes sense. Last question. I ran some benchmarks early last week using the workaround you described (submitting prompts in separate requests). The benchmarks were done on CPU only with OPENBLAS with a c5.4xlarge and m6i.12xlarge. I found it was faster or equivalent to run requests in sequence rather than running them in parallel. Do you think it's expected on CPU only, or did I miss some configurations?

I observed the following results running a Mistral model as follow:

./server --model mistral-7b-instruct-v0.1.Q4_0.gguf --port 8080  -np <nb parallel requests> -cb --ctx-size 8092

Thanks again.

@ggerganov
Copy link
Owner

@yanndupis

I think for quantum models, using OpenBLAS should be slower.
Could you post the results from the following commands:

make clean && make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4

make clean && LLAMA_OPENBLAS=1 make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4

@yanndupis
Copy link

@ggerganov

Here are the results with a c5.4xlarge instance.

make clean && make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4


|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |     64 |    1 |    192 |    9.134 |    14.01 |    6.032 |    10.61 |   15.166 |    12.66 |
|   128 |     64 |    2 |    384 |   18.403 |    13.91 |   10.567 |    12.11 |   28.970 |    13.26 |
|   128 |     64 |    3 |    576 |   27.853 |    13.79 |   15.411 |    12.46 |   43.264 |    13.31 |
|   128 |     64 |    4 |    768 |   37.396 |    13.69 |   20.212 |    12.67 |   57.607 |    13.33 |
|   128 |    128 |    1 |    256 |    9.119 |    14.04 |   12.110 |    10.57 |   21.229 |    12.06 |
|   128 |    128 |    2 |    512 |   18.384 |    13.92 |   21.292 |    12.02 |   39.676 |    12.90 |
|   128 |    128 |    3 |    768 |   27.812 |    13.81 |   30.970 |    12.40 |   58.783 |    13.07 |
|   128 |    128 |    4 |   1024 |   37.379 |    13.70 |   40.625 |    12.60 |   78.005 |    13.13 |
|   128 |    256 |    1 |    384 |    9.121 |    14.03 |   24.325 |    10.52 |   33.446 |    11.48 |
|   128 |    256 |    2 |    768 |   18.385 |    13.92 |   42.826 |    11.96 |   61.212 |    12.55 |
|   128 |    256 |    3 |   1152 |   27.812 |    13.81 |   62.700 |    12.25 |   90.512 |    12.73 |
|   128 |    256 |    4 |   1536 |   37.365 |    13.70 |   82.518 |    12.41 |  119.883 |    12.81 |
|   256 |     64 |    1 |    320 |   18.373 |    13.93 |    6.100 |    10.49 |   24.473 |    13.08 |
|   256 |     64 |    2 |    640 |   37.377 |    13.70 |   10.739 |    11.92 |   48.116 |    13.30 |
|   256 |     64 |    3 |    960 |   56.377 |    13.62 |   15.807 |    12.15 |   72.184 |    13.30 |
|   256 |     64 |    4 |   1280 |   75.979 |    13.48 |   20.796 |    12.31 |   96.775 |    13.23 |
|   256 |    128 |    1 |    384 |   18.377 |    13.93 |   12.262 |    10.44 |   30.638 |    12.53 |
|   256 |    128 |    2 |    768 |   37.374 |    13.70 |   21.536 |    11.89 |   58.910 |    13.04 |
|   256 |    128 |    3 |   1152 |   56.354 |    13.63 |   31.689 |    12.12 |   88.042 |    13.08 |
|   256 |    128 |    4 |   1536 |   75.947 |    13.48 |   41.910 |    12.22 |  117.856 |    13.03 |
|   256 |    256 |    1 |    512 |   18.388 |    13.92 |   24.531 |    10.44 |   42.919 |    11.93 |
|   256 |    256 |    2 |   1024 |   37.364 |    13.70 |   43.535 |    11.76 |   80.899 |    12.66 |
|   256 |    256 |    3 |   1536 |   56.337 |    13.63 |   64.210 |    11.96 |  120.547 |    12.74 |
|   256 |    256 |    4 |   2048 |   75.977 |    13.48 |   85.182 |    12.02 |  161.159 |    12.71 |
|   512 |     64 |    1 |    576 |   37.360 |    13.70 |    6.212 |    10.30 |   43.572 |    13.22 |
|   512 |     64 |    2 |   1152 |   75.952 |    13.48 |   11.064 |    11.57 |   87.015 |    13.24 |
|   512 |     64 |    3 |   1728 |  115.824 |    13.26 |   16.481 |    11.65 |  132.306 |    13.06 |
|   512 |     64 |    4 |   2304 |  156.991 |    13.05 |   21.984 |    11.64 |  178.976 |    12.87 |
|   512 |    128 |    1 |    640 |   37.355 |    13.71 |   12.377 |    10.34 |   49.733 |    12.87 |
|   512 |    128 |    2 |   1280 |   75.980 |    13.48 |   22.245 |    11.51 |   98.225 |    13.03 |
|   512 |    128 |    3 |   1920 |  115.843 |    13.26 |   33.178 |    11.57 |  149.021 |    12.88 |
|   512 |    128 |    4 |   2560 |  156.904 |    13.05 |   44.468 |    11.51 |  201.372 |    12.71 |
|   512 |    256 |    1 |    768 |   37.368 |    13.70 |   25.535 |    10.03 |   62.903 |    12.21 |
|   512 |    256 |    2 |   1536 |   76.349 |    13.41 |   45.810 |    11.18 |  122.159 |    12.57 |
|   512 |    256 |    3 |   2304 |  116.028 |    13.24 |   67.129 |    11.44 |  183.157 |    12.58 |
|   512 |    256 |    4 |   3072 |  156.851 |    13.06 |   89.953 |    11.38 |  246.804 |    12.45 |
make clean && LLAMA_OPENBLAS=1 make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |     64 |    1 |    192 |   18.174 |     7.04 |    6.118 |    10.46 |   24.292 |     7.90 |
|   128 |     64 |    2 |    384 |   36.632 |     6.99 |   10.612 |    12.06 |   47.244 |     8.13 |
|   128 |     64 |    3 |    576 |   55.341 |     6.94 |   15.441 |    12.43 |   70.783 |     8.14 |
|   128 |     64 |    4 |    768 |   74.304 |     6.89 |   20.213 |    12.66 |   94.517 |     8.13 |
|   128 |    128 |    1 |    256 |   18.149 |     7.05 |   12.244 |    10.45 |   30.392 |     8.42 |
|   128 |    128 |    2 |    512 |   36.582 |     7.00 |   21.313 |    12.01 |   57.894 |     8.84 |
|   128 |    128 |    3 |    768 |   55.278 |     6.95 |   31.022 |    12.38 |   86.300 |     8.90 |
|   128 |    128 |    4 |   1024 |   74.250 |     6.90 |   40.818 |    12.54 |  115.069 |     8.90 |
|   128 |    256 |    1 |    384 |   18.177 |     7.04 |   24.641 |    10.39 |   42.818 |     8.97 |
|   128 |    256 |    2 |    768 |   36.531 |     7.01 |   42.994 |    11.91 |   79.524 |     9.66 |
|   128 |    256 |    3 |   1152 |   55.250 |     6.95 |   62.940 |    12.20 |  118.190 |     9.75 |
|   128 |    256 |    4 |   1536 |   74.210 |     6.90 |   82.951 |    12.34 |  157.161 |     9.77 |
|   256 |     64 |    1 |    320 |   36.591 |     7.00 |    6.152 |    10.40 |   42.743 |     7.49 |
|   256 |     64 |    2 |    640 |   74.140 |     6.91 |   10.775 |    11.88 |   84.915 |     7.54 |
|   256 |     64 |    3 |    960 |  111.936 |     6.86 |   15.903 |    12.07 |  127.839 |     7.51 |
|   256 |     64 |    4 |   1280 |  150.739 |     6.79 |   20.831 |    12.29 |  171.570 |     7.46 |
|   256 |    128 |    1 |    384 |   36.584 |     7.00 |   12.359 |    10.36 |   48.943 |     7.85 |
|   256 |    128 |    2 |    768 |   74.294 |     6.89 |   21.733 |    11.78 |   96.027 |     8.00 |
|   256 |    128 |    3 |   1152 |  112.010 |     6.86 |   31.817 |    12.07 |  143.826 |     8.01 |
|   256 |    128 |    4 |   1536 |  150.994 |     6.78 |   42.019 |    12.18 |  193.013 |     7.96 |
|   256 |    256 |    1 |    512 |   36.614 |     6.99 |   24.766 |    10.34 |   61.380 |     8.34 |
|   256 |    256 |    2 |   1024 |   74.287 |     6.89 |   43.797 |    11.69 |  118.085 |     8.67 |
|   256 |    256 |    3 |   1536 |  112.089 |     6.85 |   64.478 |    11.91 |  176.566 |     8.70 |
|   256 |    256 |    4 |   2048 |  150.856 |     6.79 |   85.181 |    12.02 |  236.037 |     8.68 |
|   512 |     64 |    1 |    576 |   74.310 |     6.89 |    6.246 |    10.25 |   80.556 |     7.15 |
|   512 |     64 |    2 |   1152 |  150.789 |     6.79 |   11.167 |    11.46 |  161.956 |     7.11 |
|   512 |     64 |    3 |   1728 |  229.332 |     6.70 |   16.562 |    11.59 |  245.894 |     7.03 |
|   512 |     64 |    4 |   2304 |  310.210 |     6.60 |   22.083 |    11.59 |  332.294 |     6.93 |
|   512 |    128 |    1 |    640 |   74.204 |     6.90 |   12.524 |    10.22 |   86.728 |     7.38 |
|   512 |    128 |    2 |   1280 |  150.649 |     6.80 |   22.385 |    11.44 |  173.034 |     7.40 |
|   512 |    128 |    3 |   1920 |  229.219 |     6.70 |   33.261 |    11.54 |  262.480 |     7.31 |
|   512 |    128 |    4 |   2560 |  310.168 |     6.60 |   44.466 |    11.51 |  354.634 |     7.22 |
|   512 |    256 |    1 |    768 |   74.255 |     6.90 |   25.159 |    10.18 |   99.414 |     7.73 |
|   512 |    256 |    2 |   1536 |  150.639 |     6.80 |   45.051 |    11.36 |  195.690 |     7.85 |
|   512 |    256 |    3 |   2304 |  229.249 |     6.70 |   67.291 |    11.41 |  296.540 |     7.77 |
|   512 |    256 |    4 |   3072 |  310.155 |     6.60 |   90.179 |    11.36 |  400.334 |     7.67 |

For the use case I was benchmarking, my prompt was much longer than the generated response, so it might be similar to the scenario PP=256, TG=64 or PP=512, TG=64. I didn't realize it would be overall slower for a quantized model with OpenBLAS.

@ggerganov
Copy link
Owner

You can try the OpenBLAS bench with this PR: #4240 because currently on master it is not being called at all. But my expectation is that it still would be slower, because with quantum models we need to dequantize to F32 which is expensive.

So based on the results, indeed the performance does not scale with more batches on these machines (the TG speed is roughly the same across B).
Can you also run the following memcpy benchmark to check what is the memory bandwidth:

git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp && make -j bench && ./bench -w 1 -t 8

It can take about a minute or two to run.

@yanndupis
Copy link

Thanks for the explanation @ggerganov and for continuing to look into it; it's super helpful.

Here is the output using the same instance:

git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp && make -j bench && ./bench -w 1 -t 8

memcpy:    5.46 GB/s (heat-up)
memcpy:    5.47 GB/s ( 1 thread)
memcpy:    5.46 GB/s ( 1 thread)
memcpy:   10.70 GB/s ( 2 thread)
memcpy:   15.60 GB/s ( 3 thread)
memcpy:   21.20 GB/s ( 4 thread)
memcpy:   26.38 GB/s ( 5 thread)
memcpy:   30.72 GB/s ( 6 thread)
memcpy:   30.94 GB/s ( 7 thread)
memcpy:   40.19 GB/s ( 8 thread)
sum:    -5119996794.000000

@yanndupis
Copy link

And here is the output with OpenBLAS bench using the PR: #4240. The results definitely look better.

make clean && LLAMA_OPENBLAS=1 make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   128 |     64 |    1 |    192 |   22.274 |     5.75 |    6.203 |    10.32 |   28.477 |     6.74 |
|   128 |     64 |    2 |    384 |   25.250 |    10.14 |   10.749 |    11.91 |   36.000 |    10.67 |
|   128 |     64 |    3 |    576 |   28.381 |    13.53 |   15.625 |    12.29 |   44.006 |    13.09 |
|   128 |     64 |    4 |    768 |   32.481 |    15.76 |   20.318 |    12.60 |   52.799 |    14.55 |
|   128 |    128 |    1 |    256 |   21.942 |     5.83 |   12.378 |    10.34 |   34.320 |     7.46 |
|   128 |    128 |    2 |    512 |   25.361 |    10.09 |   21.520 |    11.90 |   46.882 |    10.92 |
|   128 |    128 |    3 |    768 |   28.353 |    13.54 |   31.315 |    12.26 |   59.668 |    12.87 |
|   128 |    128 |    4 |   1024 |   32.247 |    15.88 |   41.003 |    12.49 |   73.249 |    13.98 |
|   128 |    256 |    1 |    384 |   22.347 |     5.73 |   24.747 |    10.34 |   47.094 |     8.15 |
|   128 |    256 |    2 |    768 |   24.880 |    10.29 |   43.299 |    11.82 |   68.178 |    11.26 |
|   128 |    256 |    3 |   1152 |   28.059 |    13.69 |   63.231 |    12.15 |   91.290 |    12.62 |
|   128 |    256 |    4 |   1536 |   32.309 |    15.85 |   83.177 |    12.31 |  115.486 |    13.30 |
|   256 |     64 |    1 |    320 |   25.565 |    10.01 |    6.271 |    10.21 |   31.836 |    10.05 |
|   256 |     64 |    2 |    640 |   32.120 |    15.94 |   10.925 |    11.72 |   43.045 |    14.87 |
|   256 |     64 |    3 |    960 |   59.151 |    12.98 |   16.043 |    11.97 |   75.195 |    12.77 |
|   256 |     64 |    4 |   1280 |   66.605 |    15.37 |   21.009 |    12.19 |   87.614 |    14.61 |
|   256 |    128 |    1 |    384 |   25.329 |    10.11 |   12.443 |    10.29 |   37.773 |    10.17 |
|   256 |    128 |    2 |    768 |   32.508 |    15.75 |   21.944 |    11.67 |   54.452 |    14.10 |
|   256 |    128 |    3 |   1152 |   59.150 |    12.98 |   32.070 |    11.97 |   91.220 |    12.63 |
|   256 |    128 |    4 |   1536 |   66.429 |    15.41 |   42.258 |    12.12 |  108.687 |    14.13 |
|   256 |    256 |    1 |    512 |   25.014 |    10.23 |   24.961 |    10.26 |   49.976 |    10.24 |
|   256 |    256 |    2 |   1024 |   32.018 |    15.99 |   44.029 |    11.63 |   76.047 |    13.47 |
|   256 |    256 |    3 |   1536 |   59.238 |    12.96 |   64.758 |    11.86 |  123.996 |    12.39 |
|   256 |    256 |    4 |   2048 |   66.678 |    15.36 |   85.529 |    11.97 |  152.207 |    13.46 |
|   512 |     64 |    1 |    576 |   32.695 |    15.66 |    6.345 |    10.09 |   39.040 |    14.75 |
|   512 |     64 |    2 |   1152 |   66.773 |    15.34 |   11.294 |    11.33 |   78.067 |    14.76 |
|   512 |     64 |    3 |   1728 |  105.135 |    14.61 |   16.727 |    11.48 |  121.862 |    14.18 |
|   512 |     64 |    4 |   2304 |  141.769 |    14.45 |   22.249 |    11.51 |  164.018 |    14.05 |
|   512 |    128 |    1 |    640 |   32.249 |    15.88 |   12.661 |    10.11 |   44.910 |    14.25 |
|   512 |    128 |    2 |   1280 |   66.232 |    15.46 |   22.596 |    11.33 |   88.828 |    14.41 |
|   512 |    128 |    3 |   1920 |  103.211 |    14.88 |   33.480 |    11.47 |  136.691 |    14.05 |
|   512 |    128 |    4 |   2560 |  142.096 |    14.41 |   44.732 |    11.45 |  186.828 |    13.70 |
|   512 |    256 |    1 |    768 |   32.677 |    15.67 |   25.374 |    10.09 |   58.052 |    13.23 |
|   512 |    256 |    2 |   1536 |   66.787 |    15.33 |   45.389 |    11.28 |  112.176 |    13.69 |
|   512 |    256 |    3 |   2304 |  103.819 |    14.80 |   67.621 |    11.36 |  171.440 |    13.44 |
|   512 |    256 |    4 |   3072 |  141.340 |    14.49 |   90.657 |    11.30 |  231.997 |    13.24 |

@ggerganov
Copy link
Owner

If you disable mmap and use 16 threads without OpenBLAS, it seems you can get the best performance on that instance:

diff --git a/examples/batched-bench/batched-bench.cpp b/examples/batched-bench/batched-bench.cpp
index 533c55c..277c901 100644
--- a/examples/batched-bench/batched-bench.cpp
+++ b/examples/batched-bench/batched-bench.cpp
@@ -89,6 +89,7 @@ int main(int argc, char ** argv) {
     llama_model_params model_params = llama_model_default_params();
 
     model_params.n_gpu_layers = n_gpu_layers;
+    model_params.use_mmap = false;
 
     llama_model * model = llama_load_model_from_file(params.model.c_str(), model_params);
 
@@ -104,8 +105,8 @@ int main(int argc, char ** argv) {
     ctx_params.n_batch   = 512;
     ctx_params.mul_mat_q = mmq;
 
-    ctx_params.n_threads       = params.n_threads;
-    ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
+    ctx_params.n_threads       = 16;
+    ctx_params.n_threads_batch = 16;
 
     llama_context * ctx = llama_new_context_with_model(model, ctx_params);
 
make -j batched-bench && ./batched-bench ./models/openhermes-2.5-mistral-7b.Q4_0.gguf 8192 0 0 0 256 64 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   256 |     64 |    1 |    320 |   16.815 |    15.22 |    4.977 |    12.86 |   21.792 |    14.68 |
|   256 |     64 |    2 |    640 |   34.168 |    14.98 |    9.319 |    13.74 |   43.486 |    14.72 |
|   256 |     64 |    3 |    960 |   51.462 |    14.92 |   13.814 |    13.90 |   65.276 |    14.71 |
|   256 |     64 |    4 |   1280 |   69.378 |    14.76 |   18.535 |    13.81 |   87.912 |    14.56 |

Though I would have expected it to scale better with the batch size. Not sure -- maybe I'm still missing something.

Btw, I also tried a similar Arm-based instance: c6gn.4xlarge instance and it seems faster:

make -j batched-bench && ./batched-bench ./models/openhermes-2.5-mistral-7b.Q4_0.gguf 8192 0 0 0 256 64 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   256 |     64 |    1 |    320 |   12.630 |    20.27 |    3.608 |    17.74 |   16.238 |    19.71 |
|   256 |     64 |    2 |    640 |   25.660 |    19.95 |    7.022 |    18.23 |   32.683 |    19.58 |
|   256 |     64 |    3 |    960 |   38.759 |    19.81 |   10.527 |    18.24 |   49.286 |    19.48 |
|   256 |     64 |    4 |   1280 |   52.128 |    19.64 |   14.108 |    18.15 |   66.236 |    19.32 |

For comparison, here is how it scales on my AMD Ryzen 9 5950X 16-Core Processor at home:

make -j batched-bench && ./batched-bench ./models/openhermes-2.5-mistral-7b.Q4_0.gguf 8192 0 0 0 256 64 1,2,3,4

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   256 |     64 |    1 |    320 |    5.474 |    46.77 |    9.402 |     6.81 |   14.876 |    21.51 |
|   256 |     64 |    2 |    640 |   11.282 |    45.38 |    9.494 |    13.48 |   20.776 |    30.80 |
|   256 |     64 |    3 |    960 |   16.953 |    45.30 |    9.932 |    19.33 |   26.884 |    35.71 |
|   256 |     64 |    4 |   1280 |   23.334 |    43.88 |   10.460 |    24.47 |   33.794 |    37.88 |

Note how the TG time for 1,2,3,4 batches is almost constant - this is what we normally want.

@yanndupis
Copy link

Excellent, thank you @ggerganov, for sharing these findings. I will then focus my efforts on Arm-bases instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed server/webui
Projects
Status: Done
Development

No branches or pull requests

7 participants