feat: HTTP server for streaming inferences #37

jack-michaud · 2023-03-18T13:13:41Z

llama-http

This follows the pattern in llamacord; a model is instantiated in a thread and inference requests are sent to this thread to be served sequentially. Tokens are sent back to requests one at a time.

There are some issues with this -- Namely the model context does not seem to be reset between requests, so if you cancel your current request and make another one, the previous chain of thought is continued into the new request. Not sure if there's a way around this yet, and it's possible it's a bug in my code.

Instructions

To run llama-http, from the root of the repository run

cargo r --release --bin llama-http -- --model-path <path-to-your-ggml-weights>

Some (but not all) CLI arguments are supported from the CLI. Use --help to see help for all arguments. -P can be used to set the port (e.g. -P 8080).

To use the /stream endpoint, send a POST request with a body containing a InferenceHttpRequest in JSON format. Here's an example curl command that does this:

$ curl -X POST http://localhost:8080/stream --data '{"prompt": "How are you?", "num_predict": 20}'

How
 are
 you
?
 I
’
ve
 missed
 you
!


I
 haven
’
t
 had
 much
 to
 say
 about
 the
 blog
 l
ately

philpax · 2023-03-18T14:47:09Z

Huh, odd that the entire stream needs to be consumed. Not sure what's going on there - I thought perhaps wrap_stream might not do what we want here, but from what I can tell it should.

For what it's worth, manually iterating over the stream works for me in llamacord (it updates as new tokens come in):

https://github.com/philpax/llamacord/blob/bd63d6b6f689790d03b04672d5a6e87808aba9f7/src/main.rs#L401-L415

jack-michaud · 2023-03-18T23:32:26Z

Huh, odd that the entire stream needs to be consumed. Not sure what's going on there - I thought perhaps wrap_stream might not do what we want here, but from what I can tell it should.

For what it's worth, manually iterating over the stream works for me in llamacord (it updates as new tokens come in):

https://github.com/philpax/llamacord/blob/bd63d6b6f689790d03b04672d5a6e87808aba9f7/src/main.rs#L401-L415

Funny! It must be the wrap_stream method. It worked with an async mcsp receiver, so maybe I can make a shim with that.

hlhr202 · 2023-03-19T10:43:55Z

Huh, odd that the entire stream needs to be consumed. Not sure what's going on there - I thought perhaps wrap_stream might not do what we want here, but from what I can tell it should.

For what it's worth, manually iterating over the stream works for me in llamacord (it updates as new tokens come in):

https://github.com/philpax/llamacord/blob/bd63d6b6f689790d03b04672d5a6e87808aba9f7/src/main.rs#L401-L415

I m using flume with actix async executor (if not wrong should be tokio behind it)
I follow your approach by spawning a new actix task for looping flume response, but the token is not arrived in real time.
still cant make it work.

can help me figure out the problem plz? thx!

jack-michaud · 2023-03-19T12:30:03Z

@hlhr202 is your first actix executor putting tokens onto the stream?

It would help if I wasn't blocking the thread by synchronously waiting for the model generation to stop 😆 Async is tough.

hlhr202 · 2023-03-19T12:47:15Z

@hlhr202 is your first actix executor putting tokens onto the stream?

It would help if I wasn't blocking the thread by synchronously waiting for the model generation to stop 😆 Async is tough.

@jack-michaud yes, I create llama instance in my first actix executor, and I give it a receiver so i can pass in inference query from another thread. i also give llama instance a sender so it can pop inference msg to other threads. but still i cant receive the inference msg in real time...
it looks like llama-rs will only put the string in buffer when inferencing, but not throwing it out. thats why we have to flush stdout to see a real time console generation.

jack-michaud · 2023-03-19T12:50:07Z

That sounds similar to the discord bot's architecture. Maybe it's the executor that we are using. I tried creating a session in a tokio executor, and it still wasn't sending tokens immediately. The discord bot uses std::thread::spawn

In my case, it looks like I'm receiving tokens in the main thread, but Body::wrap_stream doesn't want to send tokens live from this stream. For me, printing out the tokens from the stream directly does show the new tokens in real time, but they do not get sent over HTTP.

jack-michaud · 2023-03-19T13:31:11Z

Oh my god. I added a newline to the tokens, and it started sending over the network. The lack of newline is why you need std::io::stdout()::flush() in the first place too. All my time in python land is like brain rot for my low level programming

setzer22

LGTM 😄 Other than few very minor comments. But I don't have that much experience in creating Rust server backends, so I'll let others with more experience comment on that.

I think having an HTTP server like this really important to allow new use cases for llama-rs, thanks a lot for working on this!

llama-http/src/cli_args.rs

llama-http/src/inference.rs

tarruda · 2023-03-24T09:48:49Z

There are some issues with this -- Namely the model context does not seem to be reset between requests, so if you cancel your current request and make another one, the previous chain of thought is continued into the new request. Not sure if there's a way around this yet

Also not sure if there's a way around this at ggml API level, but a very simple workaround is to serve each request in a forked process instead of a thread. The drawback is that win32 is not supported, but can work via WSL2.

setzer22 · 2023-03-24T10:10:15Z

Namely the model context does not seem to be reset between requests

The best way to handle this is to create one new InferenceSession per request. An inferenc session entirely encapsulates the mutable state of the transformer, so managing these is the key to keep per-user state. You can create a new session every time and it will "forget" about all previous input, or even offer an API where users can create sessions and submit multiple requests to it.

And we will probably need some sort of Mutex thing to ensure not too many requests are started in parallel. Maybe our http framework already supports this? For most machines, the limit would be attending 1 inference request at a time.

Also keep in mind each inference session adds ~500MB to memory consumption (it stores the full context window tensors), so we need to make sure to clean them up and to not start too many of them.

…anager

jack-michaud · 2023-03-26T13:07:43Z

I've incorporated your feedback, @setzer22. I also now have a separate inference session for each request.

I've created a an InferenceSessionManager whose purpose is to limit the amount of sessions created (i.e. lock until one is available). This is created in the "Inference Thread" (done), then an inference session for each "Inference Request Thread" will be created from that manager. Currently, there is no separate inference request thread that handles each request; I haven't gotten to that point yet since ggml tensors cannot be sent between threads safely (wrapping with a Arc + Mutex didn't seem to help, but I wouldn't be surprised if there was user error here).

(This is my imagined architecture: again, "Inference Request Thread" does not exist yet.)

I'd suggest merging in the single threaded state (since most machines will only handle one session at a time) and then add this in a later PR along with a max_requests to CLI_ARGS:

/// Maximum concurrent requests.
/// Keep in mind that each concurrent request will have
/// its own model instance, so this will affect memory usage.
#[arg(long, default_value_t = 1)]
pub max_requests: usize,

KerfuffleV2 · 2023-03-28T11:20:18Z

Just so it doesn't come out of the blue, I've been looking at doing something related. I've been considering creating a backend that would be transport-agnostic (the idea is just to use Tokio channels or something and then other stuff could potentially use it for implementing HTTP or whatever).

My idea is just to have a task that keeps track of jobs or sessions for the model and can interpret them in a round robin kind of way (initially at first), which would allow having multiple inference sessions/models going at the same time.

The big thing I want to do is have a way to easily pause/snapsnot sessions, specifically when it runs into certain tokens. This should allow stuff like setting bookmarks and rewinding which I think one can do some pretty interesting things with...

This is a sketch of the kind of interface I was thinking of: https://github.com/KerfuffleV2/llm-backend/blob/main/src/types.rs

I'm pretty sure it can be done as an independent project. Anyway, hope this doesn't seem like it's trying to steal your thunder or anything like that. (Also, if there's anything you can find a use for in that repo, feel free.)

philpax · 2023-05-05T09:54:18Z

Hiya! I'm going to close this PR as there's been some pretty major changes in the meantime. However, I think there's still very much a place for an open-source HTTP inference server built on top of llm.

If you're still interested in working on this, I'd love to see this reimplemented as a separate repository that targets llm.

WIP HTTP server for streaming inferences

86c4eb4

feat: HTTP server with streaming and full inference requests

2cd1cf2

jack-michaud changed the title ~~WIP HTTP server for streaming inferences~~ feat: HTTP server for streaming inferences Mar 19, 2023

jack-michaud marked this pull request as ready for review March 19, 2023 14:24

jack-michaud added 2 commits March 19, 2023 10:24

format

1608a56

docs: fix reference to old name in comment

e65d8f6

setzer22 reviewed Mar 20, 2023

View reviewed changes

llama-http/src/cli_args.rs Outdated Show resolved Hide resolved

llama-http/src/cli_args.rs Outdated Show resolved Hide resolved

llama-http/src/inference.rs Outdated Show resolved Hide resolved

llama-http/src/inference.rs Outdated Show resolved Hide resolved

karelnagel mentioned this pull request Mar 23, 2023

Make a Chat like Application #52

Closed

jack-michaud added 6 commits March 26, 2023 07:00

Merge branch 'main' of github.com:setzer22/llama-rs into jm/llama-http

2621c7e

fix: update http cli args to be more accurate

5906227

feat: use log crate in http

7b52b83

chore: fix the major versions in cargo.toml

3a9a44d

feat: add env_logger to http

541b498

refactor: move model and session instantiation into InferenceSessionM…

8940107

…anager

philpax mentioned this pull request Mar 29, 2023

Does not follow a Q&A-like format like Alpaca is supposed to and seems to provide invalid output #87

Closed

ItzDerock mentioned this pull request Apr 16, 2023

Any ideas since the TCP branch it out of date? ItzDerock/llama-playground#3

Open

philpax closed this May 5, 2023

philpax mentioned this pull request May 10, 2023

how should i build a web interface? #201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: HTTP server for streaming inferences #37

feat: HTTP server for streaming inferences #37

jack-michaud commented Mar 18, 2023 •

edited

Loading

philpax commented Mar 18, 2023 •

edited

Loading

jack-michaud commented Mar 18, 2023

hlhr202 commented Mar 19, 2023

jack-michaud commented Mar 19, 2023

hlhr202 commented Mar 19, 2023 •

edited

Loading

jack-michaud commented Mar 19, 2023 •

edited

Loading

jack-michaud commented Mar 19, 2023

setzer22 left a comment

tarruda commented Mar 24, 2023

setzer22 commented Mar 24, 2023

jack-michaud commented Mar 26, 2023

KerfuffleV2 commented Mar 28, 2023

philpax commented May 5, 2023

feat: HTTP server for streaming inferences #37

feat: HTTP server for streaming inferences #37

Conversation

jack-michaud commented Mar 18, 2023 • edited Loading

llama-http

Instructions

philpax commented Mar 18, 2023 • edited Loading

jack-michaud commented Mar 18, 2023

hlhr202 commented Mar 19, 2023

jack-michaud commented Mar 19, 2023

hlhr202 commented Mar 19, 2023 • edited Loading

jack-michaud commented Mar 19, 2023 • edited Loading

jack-michaud commented Mar 19, 2023

setzer22 left a comment

Choose a reason for hiding this comment

tarruda commented Mar 24, 2023

setzer22 commented Mar 24, 2023

jack-michaud commented Mar 26, 2023

KerfuffleV2 commented Mar 28, 2023

philpax commented May 5, 2023

jack-michaud commented Mar 18, 2023 •

edited

Loading

philpax commented Mar 18, 2023 •

edited

Loading

hlhr202 commented Mar 19, 2023 •

edited

Loading

jack-michaud commented Mar 19, 2023 •

edited

Loading