-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Codellama FIM tokenization #2818
Comments
Can you give a sample API, so I can get a better idea of how FIM is supposed to work?
We don't have one atm. We can add it, or I guess you can use Btw, make sure to take into account the changes from #2810 - will merge this soon.
What is the difference between these? |
In Fill In Middle (Section 3 of https://arxiv.org/abs/2207.14255), the idea is that you slice up the input into what comes before where you want to fill and what comes after. The stuff before is the Prefix, the stuff after is the Suffix. Then you tokenize them separately, and concatenate the tokens together with four extra tokens, like so.
The difference is just the order of the prefix and suffix. It works both ways, and in fact codellama is trained on both and should be able to accept both. At inference, you have the model predict what comes after the
Sure. This is pseudocode, but should get the point across. #define token_t int
#define prefix_token 32007
#define middle_token 32009
#define suffix_token 32008
#define eot_token 32010
token_t* llama_fim(llama_context* ctx, char* prefix, char* suffix, size_t max_tokens_generated) {
token_t* prefix_tokens = tokenize(prefix);
token_t* suffix_tokens = tokenize(suffix);
token_t* full_query = concatenate(prefix_token, prefix_tokens, suffix_token, suffix_tokens, middle_token);
token_t* middle_tokens = (token_t*)malloc(sizeof(token_t) * max_tokens_generated);
token_t pred_token;
size_t tokens_generated = 0;
while ((pred_token = llama_next_token(ctx, full_query, middle_tokens, tokens_generated)) != eot_token) {
middle_tokens[tokens_generated++] = pred_token;
if (tokens_generated == max_tokens_generated) break;
}
return middle_tokens;
} My question is basically that there's a lot of ways to write this API. The codellama repo example splits on |
Thank you for the clear summary - very useful. I guess I think ideally, it has to be a separate example or try to fit it in Back to a previous question, after some more thought:
My initial suggestion to use |
I'm new to LLM and I recently built a copilot extension using CodeLLaMA and llama.cpp. I got good code completion results with prompt like this: const prompt = `<PRE> ${prefix} <SUF>${suffix} <MID>` What is the difference between the FIM and the above prompt? |
Can you please share me the script, idea or steps you took to achieve this prompt format? |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I assume that the project will want to support Fill In Middle (FIM) tokenization to work with the codellama models. How will this be accomplished?
Reading the codellama paper (https://arxiv.org/abs/2308.12950), here's what they say about FIM:
In the addendum of the paper, they suggest to use PSM format over SPM format, or to use SPM format with token healing. PSM seems more sensible to me, at least initially.
So, about those four tokens. Here they are.
https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/tokenizer.py#L28-L31
Their values according to the tokenizer are:
With this, it should be possible to stitch together FIM functionality from the project's existing capabilities. I'm working on it, PR probably forthcoming.
My questions are:
<MID>
and such in the query string, or just a pointer into a list of tokens or characters?sp_model.piece_to_id("▁<MID>")
?The text was updated successfully, but these errors were encountered: