-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[Documentation] C API examples #384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Agreed. I'm planning to write some wrappers to port llama.cpp using the new llama.h to other languages and a documentation would be helpful. |
@SpeedyCraftah go for it, here is a rough overview: const std::string prompt = " This is the story of a man named ";
llama_context* ctx;
auto lparams = llama_context_default_params();
// load model
ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams);
// determine the required inference memory per token:
// TODO: this is a hack copied from main.cpp idk whats up here
{
const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS);
}
// convert prompt to embedings
std::vector<llama_token> embd_inp(prompt.size()+1);
auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true);
embd_inp.resize(n_of_tok);
// evaluate the prompt
for (size_t i = 0; i < embd_inp.size(); i++) {
// batch size 1
llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS);
}
std::string prediction;
std::vector<llama_token> embd = embd_inp;
for (number of tokens to predict) {
id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f);
// TODO: break here if EOS
// add it to the context (all tokens, prompt + predict)
embd.push_back(id);
// add to string
prediction += llama_token_to_str(ctx, id);
// eval next token
llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS);
}
llama_free(ctx); // cleanup
edit: removed the -1 from last eval |
The See whisper.cpp examples structure for reference. |
Absolutely wonderful! This example alone was enough to make me understand how to use the API 👍 |
:) yea i was just throwing stuff together from my own experiments and main.cpp |
Can confirm it works, thank you. I will try to cook something simple and helpful and submit it. |
Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API. |
Definitely instead of using the janky command line method of doing it and then extracting the outputs. |
@Green-Sky If you don't mind me asking, how do I go about increasing the batch size of the prompt?
Did I do something wrong or rather what I didn't do? EDIT - Just realised I didn't then divide the loop length by 8 (yes, I will handle remainders don't worry). |
@SpeedyCraftah any update on this? |
Going well! I am finished with the final mock-up, now just needs some polishing, size_t conversion warning fixes and comments, then it's ready to go, although it should be split up into multiple parts such as "example of barebones generation" and "example of generation with stop sequence" so it isn't so complex right off the bat. Only issue is that the time from loading the model to generating the first token is noticeably longer than when running the same parameters & prompt with the main.exe CLI. Would be great if you could look over it first! |
Hey!
There should be a simple example on how to use the new C API (like one that simply takes a hardcoded string and runs llama on it until \n or something like that).
Not sure the the
/examples/
directory is appropriate for this.Thanks
Niansa
The text was updated successfully, but these errors were encountered: