-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : create llamax library #5215
Comments
I recently did the same thing on my personal project, so I'd like to share with you the list of functions that I decided to keep on my part:
More info on my project: My implementation is basically a web server that takes a JSON as input, for example: |
It would be great if we could generalize grammar-based sampling, into a callback function-based approach. This would open up downstream use cases to adjust logic in arbitrary ways. (In Tabby's case, we would really like to integrate a tree-sitter grammar for similar goal). Something like |
Ok, thanks for the suggestions - these are useful. I'm thinking the API should also support the multi-sequence batched use case, where the user can dynamically insert new requests for processing (something like the current slots in the llamax_context * ctx = llamax_context_init(...);
// thread 0 or 1
ctx->add_request(...);
ctx->add_request(...);
...
// main thread
while (true) {
ctx->process(...);
}
llamax_context_free(ctx); |
Yeah I haven't yet consider about multi-sequence on my implementation. As the first step, I designed my API to be readable from top to bottom, something like: load(...)
eos_token = lookup_token("</s>")
input_tokens = tokenize("Hello, my name is")
eval(tokens)
while True:
next_token, next_piece = decode_logits()
if next_token == eos_token:
break
print(next_piece)
sampling_accept(next_token)
eval([next_token])
exit() With multi-sequence, it may become: ...
input_tokens = tokenize("Hello, my name is")
seq_id = new_seq() # added
eval(tokens, seq_id) # add seq_id
while True:
next_token, next_piece = decode_logits()
if next_token == eos_token:
break
print(next_piece)
sampling_accept(next_token, seq_id) # add seq_id
eval([next_token], seq_id) # add seq_id
delete_seq(seq_id)
exit() Would be nice if llamax can be thread-safe. For example, on the code above:
@wsxiaoys I'm not quite sure if modifying logits is suitable for high-level API or not, but maybe llamax can just expose the underlaying llama_context, then you can use low-level API to interact with the low-level context. |
I think this is a great idea. Currently, I am using llama.cpp with LlamaSharp but it does not work with the latest version of llama.cpp because of llama.cpp changes. Ideally, I would like to drop the latest llama.dll directly into my .NET project. It would be great to have a Super High level API that does NOT have breaking changes, something like a subset of the llama.cpp Python API https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#high-level-api Super High level API Contract that does not change version to version - I think this should cover 90% of use cases
There can be another second level of API like Tokenize, etc. and a low-level API. |
@AshD yeah right, seems like the The However, the |
@ngxson I was thinking of Prompt Caching. Our app Fusion Quill, calls to Llama.cpp are either Chat type calls or One off calls for things like Summarization. For the Chat use case, the messages list is [system, usermsg1] for the 1st call, then [system,usermsg1, assistant1, usermsg2, ...] For the other use case, caching the tokens for the system message will make sense. This way the Super High Level api is kept simple. |
@AshD yeah I actually have a bonus idea, but haven't got time to implement it: In the chat api, some systems may remove the oldest messages to be able to fit the history into context window. On the server side, we can detect this change then use This kind of behavior already exists in My idea is to detect the change and calculate the number of KV to shift just by comparing list of messages from the last request vs the new request. This is just plain logic codes and nothing to do with inference though. |
This issue is stale because it has been open for 30 days with no activity. |
Not stale. |
I hope someone picks this up soon. Our app, Fusion Quill uses llama.cpp via LlamaSharp. With a stable high level API, this problem should go away and it would simplify downstream llama.cpp libraries. |
Thanks a lot for this awesome project! |
FYI, I started a demo implementation on my fork (however, no intent to finish or to merge it, as I'm quite busy for now). ngxson/llama.cpp@master...ngxson:llama.cpp:xsn/llamax-demo The idea is to take mostly the same infrastructure from server example and wrap it to c-style calls: auto req = llamax_default_cmpl_request();
req.content = "My name is Bob and I am";
req.stream = true; // get token one by one in real-time
llamax_cmpl_id id = llamax_create_cmpl(ctx, req);
while (true) {
auto res = llamax_cmpl_response(ctx, id);
if (res->end) {
llamax_free_response(res);
break; // stop
} else {
printf("%s", res->content);
llamax_free_response(res);
}
} One problem is that there is no c-style API for sampling. Hopefully it will be resolved in #8508 |
Thanks. Currently the plan is pretty vague, but I guess will try to simplify the context management (i.e. the KV cache API) and improve the sampling API in |
@ggerganov hi! the gpt4all project is one third party that would be very much interested in this new API. currently we are removing support for our own one-off GPT-J arch support that utilized ggml but not llama api. this one-off arch support has resulted in us maintaining antiquated code for context management and sampling. see here for the code we're hoping to remove and replace with this new llamax API: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-backend/llmodel_shared.cpp As such we'd be happy to help with this effort. We could really use the sampling code in common as well as a lot of the code that handles the context in examples/main/main.cpp and so on. Let us know if we can help in any way. |
@ggerganov I have implemented a high-level library here: https://github.com/undreamai/LlamaLib |
@amakropoulos yes please 🙏 |
If it is of any interest to anyone, or as a starting point for llamax, I created a few months ago a high-level c++ API: https://gitlab.com/auksys/llama-xpp/-/tree/dev/1?ref_type=heads The API was modeled after some other language binding APIs (a mix of one from python and one from rust). |
Depends on: #5214
The
llamax
library will wrapllama
and expose common high-level functionality. The main goal is to ease the integration ofllama.cpp
into 3rd party projects. Ideally, most projects would interface through thellamax
API for all common use cases, while still have the option to use the low-levelllama
API for more uncommon applications that require finer control of the state.A simple way to think about
llamax
is that it will simplify all of the existing examples inllama.cpp
by hiding the low-level stuff, such as managing the KV cache and batching requests.Roughly,
llamax
will require it's own state object and a run-loop function.The specifics of the API are yet to be determined - suggestions are welcome.
The text was updated successfully, but these errors were encountered: