-
Notifications
You must be signed in to change notification settings - Fork 367
feat: HTTP server for streaming inferences #37
Conversation
Huh, odd that the entire stream needs to be consumed. Not sure what's going on there - I thought perhaps For what it's worth, manually iterating over the stream works for me in |
Funny! It must be the wrap_stream method. It worked with an async mcsp receiver, so maybe I can make a shim with that. |
I m using flume with actix async executor (if not wrong should be tokio behind it) |
@hlhr202 is your first actix executor putting tokens onto the stream? It would help if I wasn't blocking the thread by synchronously waiting for the model generation to stop 😆 Async is tough. |
@jack-michaud yes, I create llama instance in my first actix executor, and I give it a receiver so i can pass in inference query from another thread. i also give llama instance a sender so it can pop inference msg to other threads. but still i cant receive the inference msg in real time... |
That sounds similar to the discord bot's architecture. Maybe it's the executor that we are using. I tried creating a session in a tokio executor, and it still wasn't sending tokens immediately. The discord bot uses In my case, it looks like I'm receiving tokens in the main thread, but |
Oh my god. I added a newline to the tokens, and it started sending over the network. The lack of newline is why you need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 😄 Other than few very minor comments. But I don't have that much experience in creating Rust server backends, so I'll let others with more experience comment on that.
I think having an HTTP server like this really important to allow new use cases for llama-rs
, thanks a lot for working on this!
Also not sure if there's a way around this at ggml API level, but a very simple workaround is to serve each request in a forked process instead of a thread. The drawback is that win32 is not supported, but can work via WSL2. |
The best way to handle this is to create one new And we will probably need some sort of Mutex thing to ensure not too many requests are started in parallel. Maybe our http framework already supports this? For most machines, the limit would be attending 1 inference request at a time. Also keep in mind each inference session adds ~500MB to memory consumption (it stores the full context window tensors), so we need to make sure to clean them up and to not start too many of them. |
I've incorporated your feedback, @setzer22. I also now have a separate inference session for each request. I've created a an (This is my imagined architecture: again, "Inference Request Thread" does not exist yet.) I'd suggest merging in the single threaded state (since most machines will only handle one session at a time) and then add this in a later PR along with a /// Maximum concurrent requests.
/// Keep in mind that each concurrent request will have
/// its own model instance, so this will affect memory usage.
#[arg(long, default_value_t = 1)]
pub max_requests: usize, |
Just so it doesn't come out of the blue, I've been looking at doing something related. I've been considering creating a backend that would be transport-agnostic (the idea is just to use Tokio channels or something and then other stuff could potentially use it for implementing HTTP or whatever). My idea is just to have a task that keeps track of jobs or sessions for the model and can interpret them in a round robin kind of way (initially at first), which would allow having multiple inference sessions/models going at the same time. The big thing I want to do is have a way to easily pause/snapsnot sessions, specifically when it runs into certain tokens. This should allow stuff like setting bookmarks and rewinding which I think one can do some pretty interesting things with... This is a sketch of the kind of interface I was thinking of: https://github.com/KerfuffleV2/llm-backend/blob/main/src/types.rs I'm pretty sure it can be done as an independent project. Anyway, hope this doesn't seem like it's trying to steal your thunder or anything like that. (Also, if there's anything you can find a use for in that repo, feel free.) |
Hiya! I'm going to close this PR as there's been some pretty major changes in the meantime. However, I think there's still very much a place for an open-source HTTP inference server built on top of If you're still interested in working on this, I'd love to see this reimplemented as a separate repository that targets |
llama-http
This follows the pattern in llamacord; a model is instantiated in a thread and inference requests are sent to this thread to be served sequentially. Tokens are sent back to requests one at a time.
There are some issues with this -- Namely the model context does not seem to be reset between requests, so if you cancel your current request and make another one, the previous chain of thought is continued into the new request. Not sure if there's a way around this yet, and it's possible it's a bug in my code.
Instructions
To run
llama-http
, from the root of the repository runSome (but not all) CLI arguments are supported from the CLI. Use
--help
to see help for all arguments.-P
can be used to set the port (e.g.-P 8080
).To use the /stream endpoint, send a POST request with a body containing a
InferenceHttpRequest
in JSON format. Here's an example curl command that does this: