Skip to content

[Documentation] C API examples #384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
niansa opened this issue Mar 22, 2023 · 11 comments
Closed

[Documentation] C API examples #384

niansa opened this issue Mar 22, 2023 · 11 comments
Labels
documentation Improvements or additions to documentation

Comments

@niansa
Copy link
Contributor

niansa commented Mar 22, 2023

Hey!

There should be a simple example on how to use the new C API (like one that simply takes a hardcoded string and runs llama on it until \n or something like that).
Not sure the the /examples/ directory is appropriate for this.

Thanks
Niansa

@gjmulder gjmulder added the documentation Improvements or additions to documentation label Mar 22, 2023
@SpeedyCraftah
Copy link

SpeedyCraftah commented Mar 22, 2023

Agreed. I'm planning to write some wrappers to port llama.cpp using the new llama.h to other languages and a documentation would be helpful.
I am happy to look into writing an example for it if @ggerganov or anyone else isn't planning to do so.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 22, 2023

@SpeedyCraftah go for it, here is a rough overview:

const std::string prompt = " This is the story of a man named ";
llama_context* ctx;

auto lparams = llama_context_default_params();

// load model
ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams);

// determine the required inference memory per token:
// TODO: this is a hack copied from main.cpp idk whats up here
{
    const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
    llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS);
}

// convert prompt to embedings
std::vector<llama_token> embd_inp(prompt.size()+1);
auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true);
embd_inp.resize(n_of_tok);

// evaluate the prompt
for (size_t i = 0; i < embd_inp.size(); i++) {
	// batch size 1
	llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS);
}

std::string prediction;
std::vector<llama_token> embd = embd_inp;

for (number of tokens to predict) {
    id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f);

    // TODO: break here if EOS

    // add it to the context (all tokens, prompt + predict)
    embd.push_back(id);

    // add to string
    prediction += llama_token_to_str(ctx, id);

    // eval next token
    llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS);
}

llama_free(ctx); // cleanup

edit: removed the -1 from last eval

@ggerganov
Copy link
Member

The ./examples folder should contain all programs generated by the project.
For example, main.cpp has to become an example in ./examples/main.
The utils.h and utils.cpp have to be moved to ./examples folder and be shared across all example.

See whisper.cpp examples structure for reference.

@niansa
Copy link
Contributor Author

niansa commented Mar 23, 2023

@SpeedyCraftah go for it, here is a rough overview:
...

edit: removed the -1 from last eval

Absolutely wonderful! This example alone was enough to make me understand how to use the API 👍
Can verify this works, note tho that you've mixed up tok and id.

@Green-Sky
Copy link
Collaborator

Absolutely wonderful! This example alone was enough to make me understand how to use the API +1 Can verify this works, note tho that you've mixed up tok and id.

:) yea i was just throwing stuff together from my own experiments and main.cpp

@SpeedyCraftah
Copy link

SpeedyCraftah commented Mar 23, 2023

@SpeedyCraftah go for it, here is a rough overview:

const std::string prompt = " This is the story of a man named ";
llama_context* ctx;

auto lparams = llama_context_default_params();

// load model
ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams);

// determine the required inference memory per token:
// TODO: this is a hack copied from main.cpp idk whats up here
{
    const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
    llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS);
}

// convert prompt to embedings
std::vector<llama_token> embd_inp(prompt.size()+1);
auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true);
embd_inp.resize(n_of_tok);

// evaluate the prompt
for (size_t i = 0; i < embd_inp.size(); i++) {
	// batch size 1
	llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS);
}

std::string prediction;
std::vector<llama_token> embd = embd_inp;

for (number of tokens to predict) {
    id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f);

    // TODO: break here if EOS

    // add it to the context (all tokens, prompt + predict)
    embd.push_back(id);

    // add to string
    prediction += llama_token_to_str(ctx, id);

    // eval next token
    llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS);
}

llama_free(ctx); // cleanup

edit: removed the -1 from last eval

Can confirm it works, thank you.
I was wondering why it generated tokens so slowly but today enabling compiler release optimisations fixed that.
It is a CPU machine learning framework after all.

I will try to cook something simple and helpful and submit it.

@ghost
Copy link

ghost commented Mar 23, 2023

Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API.

@SpeedyCraftah
Copy link

Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API.

Definitely instead of using the janky command line method of doing it and then extracting the outputs.
I am planning to write a node-gyp binding for it so that you can directly run it via node.js.

@SpeedyCraftah
Copy link

SpeedyCraftah commented Mar 23, 2023

@Green-Sky If you don't mind me asking, how do I go about increasing the batch size of the prompt?
I tried something naive but it just seems to be resulting in undefined behaviour (I tried to set a batch of 8):

for (size_t i = 0; i < embd_inp.size(); i++) {
    llama_eval(ctx, embd_inp.data() + (i * 8), 8, i * 8, N_THREADS);
}

Did I do something wrong or rather what I didn't do?

EDIT - Just realised I didn't then divide the loop length by 8 (yes, I will handle remainders don't worry).
But it seems to be working now!

@Green-Sky
Copy link
Collaborator

@SpeedyCraftah any update on this?

@SpeedyCraftah
Copy link

@SpeedyCraftah any update on this?

Going well! I am finished with the final mock-up, now just needs some polishing, size_t conversion warning fixes and comments, then it's ready to go, although it should be split up into multiple parts such as "example of barebones generation" and "example of generation with stop sequence" so it isn't so complex right off the bat.
I also added stop sequences similar to how OpenAI does it - stops printing/saving tokens which appear to match the stop sequence at first, once it's confirmed it's not a stop sequence it replays all the tokens that weren't printed/saved as a result.

Only issue is that the time from loading the model to generating the first token is noticeably longer than when running the same parameters & prompt with the main.exe CLI.
I'm also not sure if I implemented batching correctly, I kind of took a guess on how it might be implemented, should probably look at the main CLI for that.

Would be great if you could look over it first!
https://paste.gg/p/anonymous/4440251201fd45d49d051a4d8661fee5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants