PoC: server handling multiple clients with custom attention mask api #3490

FSSRepo · 2023-10-05T19:14:23Z

From #3462, I wanted to update my fork to the latest changes from the master branch, but it went wrong :(.

Hello, I know it's something no one asked for, but some of us need it. Here's a proof of concept of a server that handles multiple clients simultaneously, thanks to the new way of working with the KV cache.

Some may wonder why reimplementing this proposal in a separate example. The current implementation of the server is quite complex, and many things could break.

Tested on:

Windows 11 x64
Intel Core i5 11400H 6 C / 12 T
RTX 3050 laptop 4 GB VRAM
16 GB of RAM DDR4 3200MHz

Server.Parallel.Improvements.mp4

This is a proof of concept for now, with some feedback and assistance, we could make it more usable.

Here is the command to start the server:

./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching

Modify --parallel to the number of slots to process clients requests.

Edit:

New video showing 4 clients at same time, my laptop almost exploded 😂.
Improve the PR note.

Lastest changes:

Improved README
Use a custom system prompt and change it on runtime

Note:

Many people are going to want to kill me when they see how I handle multithreading without using mutex; I never knew what that was for :(.

ggerganov · 2023-10-06T11:35:09Z

Works great! Here is another demo on M1 Pro with F16 7B:

server-parallel-0.mp4

@FSSRepo Can you allow pushes to your branch so I can push some fixes:

$ ▶ git push FSSRepo HEAD:fixes 
error: Authentication error: Authentication required: You must have push access to verify locks
error: failed to push some refs to 'https://github.com/FSSRepo/llama.cpp'

server-parallel : add "--reverse-prompt" + compiler warning fixes

examples/server-parallel/server.cpp

FSSRepo · 2023-10-06T13:44:08Z

@FSSRepo Can you allow pushes to your branch so I can push some fixes:

I don't have some restriction in that branch.

@ggerganov can you test full offloading model in your mac please? for confirm if there is some bug in faster generation scenario.

kiratp · 2023-10-06T23:01:04Z

First off - this is awesome! Thank you @FSSRepo!

I am going to take the shameless opportunity here to request that adding support for speculative execution be considered here - that would make the first and only OSS LLM server I have come across that supports it out of the box!

Seikho · 2023-10-07T03:01:18Z

Can this support the same generation parameters that are used in the existing example server?

ggerganov · 2023-10-08T10:43:54Z

Here is a 30B LLaMA Q4_0 serving 4 clients on M2 Ultra:

server-parallel-1.mp4

(I tried 7B and 13B, but the generation is so fast that I am not able to start the requests in parallel)

FSSRepo · 2023-10-08T18:51:21Z

(I tried 7B and 13B, but the generation is so fast that I am not able to start the requests in parallel)

I want to add the option to cancel the streaming but when I use AbortControl in the frontend it causes error, I'm thinking add a get endpoint stop?slot=id to request release the sequence.

ggerganov · 2023-10-08T19:31:39Z

And one more example using the original server UI with 70B Q8_0 model:

server-parallel-2.mp4

FSSRepo · 2023-10-08T19:41:07Z

And one more example using the original server UI with 70B Q8_0

Are you working in implement that ui or just reuse the endpoints?

ggerganov · 2023-10-08T19:49:49Z

Are you working in implement that ui

No, I just noticed it works and gave it a try. I thought you implemented it.

I want to add the option to cancel the streaming but when I use AbortControl in the frontend it causes error, I'm thinking add a get endpoint stop?slot=id to request release the sequence.

What kind of error?

AdityaSher · 2023-10-08T20:06:15Z

How is this working? Are the instances sharing the same weights or the model needs to be loaded N times?

FSSRepo · 2023-10-08T20:12:23Z

How is this working? Are the instances sharing the same weights or the model needs to be loaded N times?

Just divides the kv cache (context size) to a number of sequences, the limit is the context size because it is shared between clients.

This not reload the model, just once.

FSSRepo · 2023-10-08T20:53:30Z

No, I just noticed it works and gave it a try. I thought you implemented it.

Only had reused the completion.js function. 😂

FSSRepo · 2023-10-08T21:34:05Z

What kind of error?

 controller = new AbortController();
    const response = await fetch("http://localhost:8080/completion", {
      method: "POST",
      body: JSON.stringify(options),
      headers: {
        Connection: "keep-alive",
        "Content-Type": "application/json",
        Accept: "text/event-stream",
      },
      signal: controller.signal,
    });

function cancel() {
  if(controller) {
/* Anyway, even though I aborted it, the slot doesn't release, it continues to generate because the stream doesn't receive the signal that the connection was closed. 
Easy fix: create a stop endpoint to notify to slot release
*/
    controller.abort(); // when i call this function i get DOMException
  }
}

KerfuffleV2 · 2023-10-08T23:16:12Z

when I use AbortControl in the frontend it causes error,

"Note: When abort() is called, the fetch() promise rejects with a DOMException named AbortError." — https://developer.mozilla.org/en-US/docs/Web/API/AbortController

So that's probably normal. You can possibly try to catch the exception, like in the example there. I wouldn't think that would cause the connection not to abort properly though, so it not stopping generation on the server side is probably a different problem.

FSSRepo · 2023-10-08T23:22:14Z

so it not stopping generation on the server side is probably a different problem.

I will fix it adding a GET endpoint stop_completion to notify the server release the slot.

FSSRepo · 2023-10-09T02:48:02Z

@ggerganov can you merge the latest changes of master in this branch, please. I'm afraid to make a mistake again.

examples/server-parallel/server.cpp

ggerganov · 2023-10-11T07:11:25Z

This example is very nice to have, but I'm thinking if we should try to directly implement the functionality in the existing server example. I'm not sure if I'll have time to do it myself though and I don't want to delay this super long. Also, the current implementation keeps one CPU core at 100% all the time which is not desirable.

I'll give this PR some more time to see if people would be interested in improving this. If not, we will probably merge this and add an item on the roadmap for the future.

FSSRepo · 2023-10-11T12:09:56Z

This example is very nice to have, but I'm thinking if we should try to directly implement the functionality in the existing server example. I'm not sure if I'll have time to do it myself though and I don't want to delay this super long. Also, the current implementation keeps one CPU core at 100% all the time which is not desirable.

I'll give this PR some more time to see if people would be interested in improving this. If not, we will probably merge this and add an item on the roadmap for the future.

I will do it, i just want that you push the latest changes of master in my branch fixes, I don't know how to do that. Or if you teach me, I will be glad.

cebtenzzre · 2023-10-11T13:45:10Z

Or if you teach me, I will be glad.

You can add this repo as a remote in your fork (git remote add upstream https://github.com/ggerganov/llama.cpp.git) and then pull from it with your 'fixes' branch checked out (git pull --no-rebase upstream master).

…fixes

FSSRepo · 2023-10-11T17:04:59Z

@cebtenzzre Thank you so much!

FSSRepo · 2023-10-13T18:35:11Z

@ggerganov "Do you want me to close this PR? I initially proposed this as a simple example to avoid the complexity of the server example.

ggerganov · 2023-10-17T16:32:58Z

@FSSRepo Let's see if we can make the other PR work and if so, we will close this one

ggerganov · 2023-10-22T19:54:00Z

Superseded #3677

FSSRepo and others added 4 commits October 5, 2023 15:12

server handling multiple clients with cam

a7a6ceb

remove trail whitespace

eb75395

fix json format README

afc09db

server-parallel : add "--reverse-prompt" + compiler warning fixes

5ab6c21

Merge pull request #4 from ggerganov/server-parallel

bb093eb

server-parallel : add "--reverse-prompt" + compiler warning fixes

ggerganov reviewed Oct 6, 2023

View reviewed changes

examples/server-parallel/server.cpp Outdated Show resolved Hide resolved

httplib.h json.hpp -> common lib

c12e18f

FSSRepo added 7 commits October 6, 2023 09:53

ci: wrong indent style fixed

c71d933

added cors middleware

cdceda3

fix makefile server build

f0c646f

log sys - build info + rnd seed

6a5d673

example added to makefile

2fdc181

improve README + more questions

c1ac53f

improved token gen logic and limits

a8435c3

nivibilla mentioned this pull request Oct 7, 2023

Parallel decoding turboderp/exllamav2#95

Closed

Merge branch 'master' into HEAD

2f7f634

gitignore : server-parallel

f861ff9

FSSRepo mentioned this pull request Oct 8, 2023

Fix mirostat state when using multiple sequences #3543

Merged

nivibilla mentioned this pull request Oct 8, 2023

does vllm support call generate concurrent in multithreading? vllm-project/vllm#1285

Closed

FSSRepo added 2 commits October 8, 2023 22:30

fix cors + regen + cancel funcs

8a8535b

remove useless line

c8d7b1b

jhen0409 reviewed Oct 9, 2023

View reviewed changes

examples/server-parallel/server.cpp Outdated Show resolved Hide resolved

FSSRepo added 3 commits October 9, 2023 07:53

fixed cancel + removed useless code

59e7c0c

refactored some issues

8d3681d

fix unexpected behavior when multiple requests are canceled

6a2e064

FSSRepo added 2 commits October 11, 2023 12:22

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

32b237a

…fixes

avoid 100% cpu usage all time

e86a7d2

ggerganov closed this Oct 22, 2023

FSSRepo deleted the fixes branch October 22, 2023 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC: server handling multiple clients with custom attention mask api #3490

PoC: server handling multiple clients with custom attention mask api #3490

FSSRepo commented Oct 5, 2023 •

edited

Loading

ggerganov commented Oct 6, 2023

FSSRepo commented Oct 6, 2023 •

edited

Loading

kiratp commented Oct 6, 2023 •

edited

Loading

Seikho commented Oct 7, 2023

ggerganov commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

ggerganov commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

ggerganov commented Oct 8, 2023 •

edited

Loading

AdityaSher commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

FSSRepo commented Oct 8, 2023 •

edited

Loading

KerfuffleV2 commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

FSSRepo commented Oct 9, 2023 •

edited

Loading

ggerganov commented Oct 11, 2023

FSSRepo commented Oct 11, 2023 •

edited

Loading

cebtenzzre commented Oct 11, 2023

FSSRepo commented Oct 11, 2023

FSSRepo commented Oct 13, 2023

ggerganov commented Oct 17, 2023

ggerganov commented Oct 22, 2023

PoC: server handling multiple clients with custom attention mask api #3490

PoC: server handling multiple clients with custom attention mask api #3490

Conversation

FSSRepo commented Oct 5, 2023 • edited Loading

ggerganov commented Oct 6, 2023

FSSRepo commented Oct 6, 2023 • edited Loading

kiratp commented Oct 6, 2023 • edited Loading

Seikho commented Oct 7, 2023

ggerganov commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

ggerganov commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

ggerganov commented Oct 8, 2023 • edited Loading

AdityaSher commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

FSSRepo commented Oct 8, 2023 • edited Loading

KerfuffleV2 commented Oct 8, 2023

FSSRepo commented Oct 8, 2023

FSSRepo commented Oct 9, 2023 • edited Loading

ggerganov commented Oct 11, 2023

FSSRepo commented Oct 11, 2023 • edited Loading

cebtenzzre commented Oct 11, 2023

FSSRepo commented Oct 11, 2023

FSSRepo commented Oct 13, 2023

ggerganov commented Oct 17, 2023

ggerganov commented Oct 22, 2023

FSSRepo commented Oct 5, 2023 •

edited

Loading

FSSRepo commented Oct 6, 2023 •

edited

Loading

kiratp commented Oct 6, 2023 •

edited

Loading

ggerganov commented Oct 8, 2023 •

edited

Loading

FSSRepo commented Oct 8, 2023 •

edited

Loading

FSSRepo commented Oct 9, 2023 •

edited

Loading

FSSRepo commented Oct 11, 2023 •

edited

Loading