Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Codellama FIM tokenization #2818

Closed
apaz-cli opened this issue Aug 26, 2023 · 6 comments
Closed

Enhancement: Codellama FIM tokenization #2818

apaz-cli opened this issue Aug 26, 2023 · 6 comments
Labels

Comments

@apaz-cli
Copy link
Contributor

apaz-cli commented Aug 26, 2023

I assume that the project will want to support Fill In Middle (FIM) tokenization to work with the codellama models. How will this be accomplished?

Reading the codellama paper (https://arxiv.org/abs/2308.12950), here's what they say about FIM:

We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix,
the middle part or the suffix, and the end of the infilling span. To limit the distribution shift
between autoregressive and infilling training, we suppress the implicit leading space that
SentencePiece tokenizers add upon encoding the middle part and the suffix (Kudo & Richardson, 2018).
In SPM format, we concatenate the prefix and the middle part before encoding to tokens.
Note that our model doesn’t encounter split subtokens in the SPM format while it does in the PSM format.

In the addendum of the paper, they suggest to use PSM format over SPM format, or to use SPM format with token healing. PSM seems more sensible to me, at least initially.

So, about those four tokens. Here they are.
https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/tokenizer.py#L28-L31
Their values according to the tokenizer are:

self.prefix_id = 32007
self.middle_id = 32009
self.suffix_id = 32008
self.eot_id = 32010

With this, it should be possible to stitch together FIM functionality from the project's existing capabilities. I'm working on it, PR probably forthcoming.

My questions are:

  • Are there any considerations for what the FIM API should look like?
    • How do you select the middle? Do you include <MID> and such in the query string, or just a pointer into a list of tokens or characters?
  • What is the llama.cpp equivalent of sp_model.piece_to_id("▁<MID>")?
@ggerganov
Copy link
Owner

Are there any considerations for what the FIM API should look like?
How do you select the middle? Do you include and such in the query string, or just a pointer into a list of tokens or characters?

Can you give a sample API, so I can get a better idea of how FIM is supposed to work?

What is the llama.cpp equivalent of sp_model.piece_to_id("▁")?

We don't have one atm. We can add it, or I guess you can use llama_tokenize().

Btw, make sure to take into account the changes from #2810 - will merge this soon.

they suggest to use PSM format over SPM

What is the difference between these?

@apaz-cli
Copy link
Contributor Author

apaz-cli commented Aug 26, 2023

What is the difference between these?
PSM is Prefix-Suffix-Middle, and SPM is Suffix-Prefix-Middle.

In Fill In Middle (Section 3 of https://arxiv.org/abs/2207.14255), the idea is that you slice up the input into what comes before where you want to fill and what comes after. The stuff before is the Prefix, the stuff after is the Suffix. Then you tokenize them separately, and concatenate the tokens together with four extra tokens, like so.

2023-08-26-134322_3840x1080_scrot
Where is concatenation, and Enc is the tokenizer.

The difference is just the order of the prefix and suffix. It works both ways, and in fact codellama is trained on both and should be able to accept both.

At inference, you have the model predict what comes after the <MID> token, and decides when to stop generating by producing the <EOT> token.

Can you give a sample API, so I can get a better idea of how FIM is supposed to work?

Sure. This is pseudocode, but should get the point across.

#define token_t int
#define prefix_token 32007
#define middle_token 32009
#define suffix_token 32008
#define eot_token 32010

token_t* llama_fim(llama_context* ctx, char* prefix, char* suffix, size_t max_tokens_generated) {
  token_t* prefix_tokens = tokenize(prefix);
  token_t* suffix_tokens = tokenize(suffix);
  token_t* full_query = concatenate(prefix_token, prefix_tokens, suffix_token, suffix_tokens, middle_token);
  
  token_t* middle_tokens = (token_t*)malloc(sizeof(token_t) * max_tokens_generated);


  token_t pred_token;
  size_t tokens_generated = 0;
  while ((pred_token = llama_next_token(ctx, full_query, middle_tokens, tokens_generated)) != eot_token) {
    middle_tokens[tokens_generated++] = pred_token;
    if (tokens_generated == max_tokens_generated) break;
  }

  return middle_tokens;
}

My question is basically that there's a lot of ways to write this API. The codellama repo example splits on <FILL>, then tokenizes both sides. You could imagine having users just include the FIM tokens themselves and letting the tokenizer take over the work, that's how the huggingface inference API works. I prefer the two strings approach, but I'm curious as to what your thoughts are.

@ggerganov
Copy link
Owner

ggerganov commented Aug 27, 2023

Thank you for the clear summary - very useful.

I guess llama_fim cannot be part of the C-style API in llama.h. The llama.cpp library offers an interface for computing the logits of a single new token (see llama_eval). Continuous generation of long segments has to be implemented in the user code, utilizing llama_eval and optionally any built-in or 3rd party sampling functions. The main reason is specifically due to the sampling intricacies - there are many ways to sample tokens during generation and putting this whole process behind the API would be difficult and limiting.

I think ideally, it has to be a separate example or try to fit it in main.

Back to a previous question, after some more thought:

What is the llama.cpp equivalent of sp_model.piece_to_id("▁")?

My initial suggestion to use llama_tokenizer() was incorrect. I will add a corresponding call to the API.

@BlackGlory
Copy link

I'm new to LLM and I recently built a copilot extension using CodeLLaMA and llama.cpp.

I got good code completion results with prompt like this:

const prompt = `<PRE> ${prefix} <SUF>${suffix} <MID>`

What is the difference between the FIM and the above prompt?

@Kushalamummigatti
Copy link

I'm new to LLM and I recently built a copilot extension using CodeLLaMA and llama.cpp.

I got good code completion results with prompt like this:

const prompt = `<PRE> ${prefix} <SUF>${suffix} <MID>`

What is the difference between the FIM and the above prompt?

Can you please share me the script, idea or steps you took to achieve this prompt format?

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants