Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : (proposal) implement cpp-first library llama-cpp.h #10583

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Nov 29, 2024

Motivation

I was re-thinking about the idea of llamax (#5215) today. While we're still far from having llamax, to get there, we need to have a step-by-step plan.

Currently, one of the main challenge for developers who want to adopt llama.cpp is that we don't have cpp functions exposed in llama.h (ref: this discussion)

I think it would be nice to take this as the first step toward llamax. We should firstly have a library that uses cpp types like std::string and std::vector, instead of passing c-pointer.

Implementation

The llama-cpp.h seems to be a good starting point. It's already use by one single example (llama-run), so could be used as a WIP for now.

My implementation in this PR is very draft and mostly for demo purpose.

The final goal is to have all functions being "cpp-first", then having c-only llama.h as their wrapper (NOTE: this is reverse to what we do with common. Currently, the cpp common is wrapper for c functions in llama.h)

I'm opening this proposal for discussion, kindly invite @slaren and @ggerganov to join in.

Thank you.


@slaren
Copy link
Member

slaren commented Nov 29, 2024

I like the proposal overall, but I am concerned that if we do this, first, we would need to maintain two different APIs and commit to avoid breaking changes unless absolutely necessary. This would increase the maintenance cost. Second, there is a significant risk that the C API would become a second class citizen, which is something that we must absolutely avoid because it is the only way to have a stable ABI that can be used from other languages.

@ngxson
Copy link
Collaborator Author

ngxson commented Nov 29, 2024

What I'm thinking is that when looking at the big picture, it's not adding too much maintenance cost, because:

  • The underlay code of most functions (i.e. llama_*_impl and llama_*_internal) already use cpp types, and most public API/ABI functions are indeed a layer on top that converts cpp --> c
  • We already doing cpp <--> c conversion in common library, which a bit costly itself (both in term of maintenance and performance). By exposing directly these cpp definition, we can eliminate the need of maintaining some functions that does cpp --> c --> cpp

In the long term, I think it will be cleaner to separate the library into 2 layers:

  1. the core that provide cpp functions
  2. a small wrapper outside that converts cpp --> c to be used for FFI binding

This will effectively make c library to be a second-class citizen. But I think this is somewhat acceptable. My point is that:

  • The cpp --> c conversion must be just a thin wrapper, so the cost to maintain it should be small (we can also use some macro to make it even simpler to maintain)
  • Even though the ABI is in c-style, the actual function will take the same shape as the one found in llama-cpp.h

Let me take an example, we have tokenize function in cpp:

std::vector<llama_token> tokenize(
        const llama_cpp::model & model,
             const std::string & raw_text,
                          bool   add_special,
                          bool   parse_special = false);

Which will then be wrapped into c:

int32_t llama_tokenize(
    const struct llama_model * model,
                  const char * text,
                     int32_t   text_len,
                 llama_token * tokens,
                     int32_t   n_tokens_max,
                        bool   add_special,
                        bool   parse_special);

But when exposed to a high-level language like python, end-user expect the function to take this shape:

tokenize(
    model: LlamaModel,
    raw_text: str,
    add_special: bool,
    parse_special: bool) -> List[LlamaToken]

So having cpp function as a reference point should give a more "universal" experience in this case.

@ggerganov
Copy link
Member

The C++ API is only usable by C++ and nothing else. The C-style API is universal and can be used by any language, so it has to be the first-class citizen. The examples must use the C-style interface in order to exercise it, even if it were a very thin layer on top of a C++ interface.

Currently, one of the main challenge for developers who want to adopt llama.cpp is that we don't have cpp functions exposed in llama.h (ref: #8074 (reply in thread))

We cannot draw conclusions based just on this discussion, so I'm not convinced that this is the main challenge atm.

We can keep extending llama-cpp.h to make it more friendly to write C++. Ideally, common would not exist and everything would be using llama.h (+ llama-cpp.h). The reason to have common is to reduce the boilerplate. And I also see it as a staging ground for new features (like the grammar, samplers, speculative decoding, etc.), which we are not sure how to directly put in llama and need to iterate a bit before doing so.

There are various inefficiencies in the existing C-style API, like the tokenization that you pointed out, but we can improve those.

Also, another advantage of not having the internal llama.cpp C++ API exposed is that we can majorly refactor it at anytime and improve the internal implementation (which I agree is needed) without any breaking changes.

For llamax I am thinking that it would also expose a C-style interface. It would be higher-level in the logic, not in the language implementation. Although, I admit that haven't thought deeply yet what it would like like.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 2, 2024

Hmm ok, that make sense, so I think for now we can keep c-style public API as first-class, and keep cpp function as internal-only.

The same can be done with llamax. Although, I'm a bit concern that because llamax is built on top of llama.h, it should ideally use directly the cpp funtions provided by llama.c, instead of doing cpp (llama.cpp) --> c (llama.h) --> cpp (llamax.cpp) --> c (llamax.h).

We can keep extending llama-cpp.h to make it more friendly to write C++. Ideally, common would not exist and everything would be using llama.h (+ llama-cpp.h).

I believe that extending llama-cpp.h will greatly reduce the development complexity for llamax. Even better, we can engineer llama-cpp.h in a way that it calls directly cpp internal function, so we can eliminate the conversion cpp (llama.cpp) --> c (llama.h)

In either way, I think having cpp and c-style wrapper separated in llama.cpp will be cleaner. Currently, some cpp functions like llama_token_to_piece_impl does not take advantage of cpp style (i.e. output buffer is passed as const char * instead of std::string or std::vector<char>). Do you think we should refactor them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants