[Documentation] C API examples #384

niansa · 2023-03-22T08:08:14Z

Hey!

There should be a simple example on how to use the new C API (like one that simply takes a hardcoded string and runs llama on it until \n or something like that).
Not sure the the /examples/ directory is appropriate for this.

Thanks
Niansa

The text was updated successfully, but these errors were encountered:

SpeedyCraftah · 2023-03-22T18:40:13Z

Agreed. I'm planning to write some wrappers to port llama.cpp using the new llama.h to other languages and a documentation would be helpful.
I am happy to look into writing an example for it if @ggerganov or anyone else isn't planning to do so.

Green-Sky · 2023-03-22T19:20:00Z

@SpeedyCraftah go for it, here is a rough overview:

const std::string prompt = " This is the story of a man named ";
llama_context* ctx;

auto lparams = llama_context_default_params();

// load model
ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams);

// determine the required inference memory per token:
// TODO: this is a hack copied from main.cpp idk whats up here
{
    const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
    llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS);
}

// convert prompt to embedings
std::vector<llama_token> embd_inp(prompt.size()+1);
auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true);
embd_inp.resize(n_of_tok);

// evaluate the prompt
for (size_t i = 0; i < embd_inp.size(); i++) {
	// batch size 1
	llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS);
}

std::string prediction;
std::vector<llama_token> embd = embd_inp;

for (number of tokens to predict) {
    id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f);

    // TODO: break here if EOS

    // add it to the context (all tokens, prompt + predict)
    embd.push_back(id);

    // add to string
    prediction += llama_token_to_str(ctx, id);

    // eval next token
    llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS);
}

llama_free(ctx); // cleanup

edit: removed the -1 from last eval

ggerganov · 2023-03-22T21:20:55Z

The ./examples folder should contain all programs generated by the project.
For example, main.cpp has to become an example in ./examples/main.
The utils.h and utils.cpp have to be moved to ./examples folder and be shared across all example.

See whisper.cpp examples structure for reference.

niansa · 2023-03-23T14:04:11Z

@SpeedyCraftah go for it, here is a rough overview:
...

edit: removed the -1 from last eval

Absolutely wonderful! This example alone was enough to make me understand how to use the API 👍
Can verify this works, note tho that you've mixed up tok and id.

Green-Sky · 2023-03-23T14:08:39Z

Absolutely wonderful! This example alone was enough to make me understand how to use the API +1 Can verify this works, note tho that you've mixed up tok and id.

:) yea i was just throwing stuff together from my own experiments and main.cpp

SpeedyCraftah · 2023-03-23T21:02:27Z

@SpeedyCraftah go for it, here is a rough overview:

const std::string prompt = " This is the story of a man named ";
llama_context* ctx;

auto lparams = llama_context_default_params();

// load model
ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams);

// determine the required inference memory per token:
// TODO: this is a hack copied from main.cpp idk whats up here
{
    const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
    llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS);
}

// convert prompt to embedings
std::vector<llama_token> embd_inp(prompt.size()+1);
auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true);
embd_inp.resize(n_of_tok);

// evaluate the prompt
for (size_t i = 0; i < embd_inp.size(); i++) {
	// batch size 1
	llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS);
}

std::string prediction;
std::vector<llama_token> embd = embd_inp;

for (number of tokens to predict) {
    id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f);

    // TODO: break here if EOS

    // add it to the context (all tokens, prompt + predict)
    embd.push_back(id);

    // add to string
    prediction += llama_token_to_str(ctx, id);

    // eval next token
    llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS);
}

llama_free(ctx); // cleanup

edit: removed the -1 from last eval

Can confirm it works, thank you.
I was wondering why it generated tokens so slowly but today enabling compiler release optimisations fixed that.
It is a CPU machine learning framework after all.

I will try to cook something simple and helpful and submit it.

ghost · 2023-03-23T21:07:46Z

Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API.

SpeedyCraftah · 2023-03-23T21:54:01Z

Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API.

Definitely instead of using the janky command line method of doing it and then extracting the outputs.
I am planning to write a node-gyp binding for it so that you can directly run it via node.js.

SpeedyCraftah · 2023-03-23T22:32:42Z

@Green-Sky If you don't mind me asking, how do I go about increasing the batch size of the prompt?
I tried something naive but it just seems to be resulting in undefined behaviour (I tried to set a batch of 8):

for (size_t i = 0; i < embd_inp.size(); i++) {
    llama_eval(ctx, embd_inp.data() + (i * 8), 8, i * 8, N_THREADS);
}

Did I do something wrong or rather what I didn't do?

EDIT - Just realised I didn't then divide the loop length by 8 (yes, I will handle remainders don't worry).
But it seems to be working now!

Green-Sky · 2023-03-24T19:03:06Z

@SpeedyCraftah any update on this?

SpeedyCraftah · 2023-03-24T19:37:53Z

@SpeedyCraftah any update on this?

Going well! I am finished with the final mock-up, now just needs some polishing, size_t conversion warning fixes and comments, then it's ready to go, although it should be split up into multiple parts such as "example of barebones generation" and "example of generation with stop sequence" so it isn't so complex right off the bat.
I also added stop sequences similar to how OpenAI does it - stops printing/saving tokens which appear to match the stop sequence at first, once it's confirmed it's not a stop sequence it replays all the tokens that weren't printed/saved as a result.

Only issue is that the time from loading the model to generating the first token is noticeably longer than when running the same parameters & prompt with the main.exe CLI.
I'm also not sure if I implemented batching correctly, I kind of took a guess on how it might be implemented, should probably look at the main CLI for that.

Would be great if you could look over it first!
https://paste.gg/p/anonymous/4440251201fd45d49d051a4d8661fee5

gjmulder added the documentation Improvements or additions to documentation label Mar 22, 2023

mqy mentioned this issue Mar 23, 2023

Proof of concept TCP server mode #278

Closed

Green-Sky mentioned this issue Jun 13, 2023

Minimalist example #1840

Merged

ggerganov closed this as completed Jun 16, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Documentation] C API examples #384

[Documentation] C API examples #384

niansa commented Mar 22, 2023

SpeedyCraftah commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

ggerganov commented Mar 22, 2023

niansa commented Mar 23, 2023

Green-Sky commented Mar 23, 2023

SpeedyCraftah commented Mar 23, 2023 •

edited

Loading

ghost commented Mar 23, 2023

SpeedyCraftah commented Mar 23, 2023

SpeedyCraftah commented Mar 23, 2023 •

edited

Loading

Green-Sky commented Mar 24, 2023

SpeedyCraftah commented Mar 24, 2023

[Documentation] C API examples #384

[Documentation] C API examples #384

Comments

niansa commented Mar 22, 2023

SpeedyCraftah commented Mar 22, 2023 • edited Loading

Green-Sky commented Mar 22, 2023 • edited Loading

ggerganov commented Mar 22, 2023

niansa commented Mar 23, 2023

Green-Sky commented Mar 23, 2023

SpeedyCraftah commented Mar 23, 2023 • edited Loading

ghost commented Mar 23, 2023

SpeedyCraftah commented Mar 23, 2023

SpeedyCraftah commented Mar 23, 2023 • edited Loading

Green-Sky commented Mar 24, 2023

SpeedyCraftah commented Mar 24, 2023

SpeedyCraftah commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

SpeedyCraftah commented Mar 23, 2023 •

edited

Loading

SpeedyCraftah commented Mar 23, 2023 •

edited

Loading