Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

feat: HTTP server for streaming inferences #37

Closed
wants to merge 10 commits into from

Conversation

jack-michaud
Copy link

@jack-michaud jack-michaud commented Mar 18, 2023

llama-http

This follows the pattern in llamacord; a model is instantiated in a thread and inference requests are sent to this thread to be served sequentially. Tokens are sent back to requests one at a time.

There are some issues with this -- Namely the model context does not seem to be reset between requests, so if you cancel your current request and make another one, the previous chain of thought is continued into the new request. Not sure if there's a way around this yet, and it's possible it's a bug in my code.

Instructions

To run llama-http, from the root of the repository run

cargo r --release --bin llama-http -- --model-path <path-to-your-ggml-weights>

Some (but not all) CLI arguments are supported from the CLI. Use --help to see help for all arguments. -P can be used to set the port (e.g. -P 8080).

To use the /stream endpoint, send a POST request with a body containing a InferenceHttpRequest in JSON format. Here's an example curl command that does this:

$ curl -X POST http://localhost:8080/stream --data '{"prompt": "How are you?", "num_predict": 20}'

How
 are
 you
?
 I
’
ve
 missed
 you
!


I
 haven
’
t
 had
 much
 to
 say
 about
 the
 blog
 l
ately

@philpax
Copy link
Collaborator

philpax commented Mar 18, 2023

Huh, odd that the entire stream needs to be consumed. Not sure what's going on there - I thought perhaps wrap_stream might not do what we want here, but from what I can tell it should.

For what it's worth, manually iterating over the stream works for me in llamacord (it updates as new tokens come in):

https://github.com/philpax/llamacord/blob/bd63d6b6f689790d03b04672d5a6e87808aba9f7/src/main.rs#L401-L415

@jack-michaud
Copy link
Author

Huh, odd that the entire stream needs to be consumed. Not sure what's going on there - I thought perhaps wrap_stream might not do what we want here, but from what I can tell it should.

For what it's worth, manually iterating over the stream works for me in llamacord (it updates as new tokens come in):

https://github.com/philpax/llamacord/blob/bd63d6b6f689790d03b04672d5a6e87808aba9f7/src/main.rs#L401-L415

Funny! It must be the wrap_stream method. It worked with an async mcsp receiver, so maybe I can make a shim with that.

@hlhr202
Copy link
Contributor

hlhr202 commented Mar 19, 2023

Huh, odd that the entire stream needs to be consumed. Not sure what's going on there - I thought perhaps wrap_stream might not do what we want here, but from what I can tell it should.

For what it's worth, manually iterating over the stream works for me in llamacord (it updates as new tokens come in):

https://github.com/philpax/llamacord/blob/bd63d6b6f689790d03b04672d5a6e87808aba9f7/src/main.rs#L401-L415

I m using flume with actix async executor (if not wrong should be tokio behind it)
I follow your approach by spawning a new actix task for looping flume response, but the token is not arrived in real time.
still cant make it work.

image
can help me figure out the problem plz? thx!

@jack-michaud
Copy link
Author

@hlhr202 is your first actix executor putting tokens onto the stream?

It would help if I wasn't blocking the thread by synchronously waiting for the model generation to stop 😆 Async is tough.

@hlhr202
Copy link
Contributor

hlhr202 commented Mar 19, 2023

@hlhr202 is your first actix executor putting tokens onto the stream?

It would help if I wasn't blocking the thread by synchronously waiting for the model generation to stop 😆 Async is tough.

@jack-michaud yes, I create llama instance in my first actix executor, and I give it a receiver so i can pass in inference query from another thread. i also give llama instance a sender so it can pop inference msg to other threads. but still i cant receive the inference msg in real time...
it looks like llama-rs will only put the string in buffer when inferencing, but not throwing it out. thats why we have to flush stdout to see a real time console generation.

@jack-michaud
Copy link
Author

jack-michaud commented Mar 19, 2023

That sounds similar to the discord bot's architecture. Maybe it's the executor that we are using. I tried creating a session in a tokio executor, and it still wasn't sending tokens immediately. The discord bot uses std::thread::spawn

In my case, it looks like I'm receiving tokens in the main thread, but Body::wrap_stream doesn't want to send tokens live from this stream. For me, printing out the tokens from the stream directly does show the new tokens in real time, but they do not get sent over HTTP.

@jack-michaud
Copy link
Author

Oh my god. I added a newline to the tokens, and it started sending over the network. The lack of newline is why you need std::io::stdout()::flush() in the first place too. All my time in python land is like brain rot for my low level programming

@jack-michaud jack-michaud changed the title WIP HTTP server for streaming inferences feat: HTTP server for streaming inferences Mar 19, 2023
@jack-michaud jack-michaud marked this pull request as ready for review March 19, 2023 14:24
Copy link
Collaborator

@setzer22 setzer22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 😄 Other than few very minor comments. But I don't have that much experience in creating Rust server backends, so I'll let others with more experience comment on that.

I think having an HTTP server like this really important to allow new use cases for llama-rs, thanks a lot for working on this!

llama-http/src/cli_args.rs Outdated Show resolved Hide resolved
llama-http/src/cli_args.rs Outdated Show resolved Hide resolved
llama-http/src/inference.rs Outdated Show resolved Hide resolved
llama-http/src/inference.rs Outdated Show resolved Hide resolved
@tarruda
Copy link

tarruda commented Mar 24, 2023

There are some issues with this -- Namely the model context does not seem to be reset between requests, so if you cancel your current request and make another one, the previous chain of thought is continued into the new request. Not sure if there's a way around this yet

Also not sure if there's a way around this at ggml API level, but a very simple workaround is to serve each request in a forked process instead of a thread. The drawback is that win32 is not supported, but can work via WSL2.

@setzer22
Copy link
Collaborator

Namely the model context does not seem to be reset between requests

The best way to handle this is to create one new InferenceSession per request. An inferenc session entirely encapsulates the mutable state of the transformer, so managing these is the key to keep per-user state. You can create a new session every time and it will "forget" about all previous input, or even offer an API where users can create sessions and submit multiple requests to it.

And we will probably need some sort of Mutex thing to ensure not too many requests are started in parallel. Maybe our http framework already supports this? For most machines, the limit would be attending 1 inference request at a time.

Also keep in mind each inference session adds ~500MB to memory consumption (it stores the full context window tensors), so we need to make sure to clean them up and to not start too many of them.

@jack-michaud
Copy link
Author

I've incorporated your feedback, @setzer22. I also now have a separate inference session for each request.

I've created a an InferenceSessionManager whose purpose is to limit the amount of sessions created (i.e. lock until one is available). This is created in the "Inference Thread" (done), then an inference session for each "Inference Request Thread" will be created from that manager. Currently, there is no separate inference request thread that handles each request; I haven't gotten to that point yet since ggml tensors cannot be sent between threads safely (wrapping with a Arc + Mutex didn't seem to help, but I wouldn't be surprised if there was user error here).

image

(This is my imagined architecture: again, "Inference Request Thread" does not exist yet.)

I'd suggest merging in the single threaded state (since most machines will only handle one session at a time) and then add this in a later PR along with a max_requests to CLI_ARGS:

/// Maximum concurrent requests.
/// Keep in mind that each concurrent request will have
/// its own model instance, so this will affect memory usage.
#[arg(long, default_value_t = 1)]
pub max_requests: usize,

@KerfuffleV2
Copy link
Contributor

Just so it doesn't come out of the blue, I've been looking at doing something related. I've been considering creating a backend that would be transport-agnostic (the idea is just to use Tokio channels or something and then other stuff could potentially use it for implementing HTTP or whatever).

My idea is just to have a task that keeps track of jobs or sessions for the model and can interpret them in a round robin kind of way (initially at first), which would allow having multiple inference sessions/models going at the same time.

The big thing I want to do is have a way to easily pause/snapsnot sessions, specifically when it runs into certain tokens. This should allow stuff like setting bookmarks and rewinding which I think one can do some pretty interesting things with...

This is a sketch of the kind of interface I was thinking of: https://github.com/KerfuffleV2/llm-backend/blob/main/src/types.rs

I'm pretty sure it can be done as an independent project. Anyway, hope this doesn't seem like it's trying to steal your thunder or anything like that. (Also, if there's anything you can find a use for in that repo, feel free.)

@philpax
Copy link
Collaborator

philpax commented May 5, 2023

Hiya! I'm going to close this PR as there's been some pretty major changes in the meantime. However, I think there's still very much a place for an open-source HTTP inference server built on top of llm.

If you're still interested in working on this, I'd love to see this reimplemented as a separate repository that targets llm.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants