Enhancement: Codellama FIM tokenization #2818

apaz-cli · 2023-08-26T17:29:09Z

I assume that the project will want to support Fill In Middle (FIM) tokenization to work with the codellama models. How will this be accomplished?

Reading the codellama paper (https://arxiv.org/abs/2308.12950), here's what they say about FIM:

We extend Llama 2’s tokenizer with four special tokens that mark the beginning of the prefix,
the middle part or the suffix, and the end of the infilling span. To limit the distribution shift
between autoregressive and infilling training, we suppress the implicit leading space that
SentencePiece tokenizers add upon encoding the middle part and the suffix (Kudo & Richardson, 2018).
In SPM format, we concatenate the prefix and the middle part before encoding to tokens.
Note that our model doesn’t encounter split subtokens in the SPM format while it does in the PSM format.

In the addendum of the paper, they suggest to use PSM format over SPM format, or to use SPM format with token healing. PSM seems more sensible to me, at least initially.

So, about those four tokens. Here they are.
https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/tokenizer.py#L28-L31
Their values according to the tokenizer are:

self.prefix_id = 32007
self.middle_id = 32009
self.suffix_id = 32008
self.eot_id = 32010

With this, it should be possible to stitch together FIM functionality from the project's existing capabilities. I'm working on it, PR probably forthcoming.

My questions are:

Are there any considerations for what the FIM API should look like?
- How do you select the middle? Do you include <MID> and such in the query string, or just a pointer into a list of tokens or characters?
What is the llama.cpp equivalent of sp_model.piece_to_id("▁<MID>")?

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-08-26T17:39:12Z

Are there any considerations for what the FIM API should look like?
How do you select the middle? Do you include and such in the query string, or just a pointer into a list of tokens or characters?

Can you give a sample API, so I can get a better idea of how FIM is supposed to work?

What is the llama.cpp equivalent of sp_model.piece_to_id("▁")?

We don't have one atm. We can add it, or I guess you can use llama_tokenize().

Btw, make sure to take into account the changes from #2810 - will merge this soon.

they suggest to use PSM format over SPM

What is the difference between these?

apaz-cli · 2023-08-26T19:10:48Z

What is the difference between these?
PSM is Prefix-Suffix-Middle, and SPM is Suffix-Prefix-Middle.

In Fill In Middle (Section 3 of https://arxiv.org/abs/2207.14255), the idea is that you slice up the input into what comes before where you want to fill and what comes after. The stuff before is the Prefix, the stuff after is the Suffix. Then you tokenize them separately, and concatenate the tokens together with four extra tokens, like so.

Where ◦ is concatenation, and Enc is the tokenizer.

The difference is just the order of the prefix and suffix. It works both ways, and in fact codellama is trained on both and should be able to accept both.

At inference, you have the model predict what comes after the <MID> token, and decides when to stop generating by producing the <EOT> token.

Can you give a sample API, so I can get a better idea of how FIM is supposed to work?

Sure. This is pseudocode, but should get the point across.

#define token_t int
#define prefix_token 32007
#define middle_token 32009
#define suffix_token 32008
#define eot_token 32010

token_t* llama_fim(llama_context* ctx, char* prefix, char* suffix, size_t max_tokens_generated) {
  token_t* prefix_tokens = tokenize(prefix);
  token_t* suffix_tokens = tokenize(suffix);
  token_t* full_query = concatenate(prefix_token, prefix_tokens, suffix_token, suffix_tokens, middle_token);
  
  token_t* middle_tokens = (token_t*)malloc(sizeof(token_t) * max_tokens_generated);


  token_t pred_token;
  size_t tokens_generated = 0;
  while ((pred_token = llama_next_token(ctx, full_query, middle_tokens, tokens_generated)) != eot_token) {
    middle_tokens[tokens_generated++] = pred_token;
    if (tokens_generated == max_tokens_generated) break;
  }

  return middle_tokens;
}

My question is basically that there's a lot of ways to write this API. The codellama repo example splits on <FILL>, then tokenizes both sides. You could imagine having users just include the FIM tokens themselves and letting the tokenizer take over the work, that's how the huggingface inference API works. I prefer the two strings approach, but I'm curious as to what your thoughts are.

ggerganov · 2023-08-27T07:00:41Z

Thank you for the clear summary - very useful.

I guess llama_fim cannot be part of the C-style API in llama.h. The llama.cpp library offers an interface for computing the logits of a single new token (see llama_eval). Continuous generation of long segments has to be implemented in the user code, utilizing llama_eval and optionally any built-in or 3rd party sampling functions. The main reason is specifically due to the sampling intricacies - there are many ways to sample tokens during generation and putting this whole process behind the API would be difficult and limiting.

I think ideally, it has to be a separate example or try to fit it in main.

Back to a previous question, after some more thought:

What is the llama.cpp equivalent of sp_model.piece_to_id("▁")?

My initial suggestion to use llama_tokenizer() was incorrect. I will add a corresponding call to the API.

BlackGlory · 2023-09-06T01:55:57Z

I'm new to LLM and I recently built a copilot extension using CodeLLaMA and llama.cpp.

I got good code completion results with prompt like this:

const prompt = `<PRE> ${prefix} <SUF>${suffix} <MID>`

What is the difference between the FIM and the above prompt?

Kushalamummigatti · 2023-11-24T08:56:35Z

I'm new to LLM and I recently built a copilot extension using CodeLLaMA and llama.cpp.

I got good code completion results with prompt like this:
const prompt = `<PRE> ${prefix} <SUF>${suffix} <MID>`
What is the difference between the FIM and the above prompt?

Can you please share me the script, idea or steps you took to achieve this prompt format?

github-actions · 2024-04-09T01:06:39Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

apaz-cli mentioned this issue Aug 31, 2023

[WIP] Add Fill-In-Middle example #2934

Closed

JegernOUTT mentioned this issue Oct 25, 2023

Code Llama Fine-tuning Support smallcloudai/refact#194

Merged

jonastemplestein mentioned this issue Nov 10, 2023

[WIP] Elixir code completion livebook-dev/livebook#2332

Open

18 tasks

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Codellama FIM tokenization #2818

Enhancement: Codellama FIM tokenization #2818

apaz-cli commented Aug 26, 2023 •

edited

Loading

ggerganov commented Aug 26, 2023

apaz-cli commented Aug 26, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading

BlackGlory commented Sep 6, 2023

Kushalamummigatti commented Nov 24, 2023

github-actions bot commented Apr 9, 2024

Enhancement: Codellama FIM tokenization #2818

Enhancement: Codellama FIM tokenization #2818

Comments

apaz-cli commented Aug 26, 2023 • edited Loading

ggerganov commented Aug 26, 2023

apaz-cli commented Aug 26, 2023 • edited Loading

ggerganov commented Aug 27, 2023 • edited Loading

BlackGlory commented Sep 6, 2023

Kushalamummigatti commented Nov 24, 2023

github-actions bot commented Apr 9, 2024

apaz-cli commented Aug 26, 2023 •

edited

Loading

apaz-cli commented Aug 26, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading