-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[WIP] Add Fill-In-Middle example #2934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
21757ee
Added FIM token IDs.
apaz-cli 93753a8
Added FIM example.
apaz-cli 90afd6d
Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …
apaz-cli 828a43d
Added makefile, better error messages
apaz-cli 16841ac
Resolved merge conflicts.
apaz-cli 1e85f6b
Updated gitignore for new example.
apaz-cli 142d79b
Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …
apaz-cli 314c29c
Debugging crash.
apaz-cli 82dcadd
Added -fsanitize=address to the makefile.
apaz-cli ca588a3
Added FIM readme.
apaz-cli 2636a8b
Added string debugging, removed bos token from end, added mlock.
apaz-cli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
set(TARGET FIM) | ||
add_executable(${TARGET} FIM.c) | ||
install(TARGETS ${TARGET} RUNTIME) | ||
target_link_libraries(${TARGET} PRIVATE llama ${CMAKE_THREAD_LIBS_INIT}) | ||
target_compile_features(${TARGET} PRIVATE cxx_std_11) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
#include <stdlib.h> | ||
#include <string.h> | ||
#include <stdio.h> | ||
#include "../../llama.h" | ||
|
||
/* | ||
The FIM (Fill-In-Middle) objective is useful for generating text conditioned on a prefix and a suffix. | ||
For a quick summary of what's going on here, see issue #2818. | ||
*/ | ||
|
||
|
||
static inline struct llama_context* | ||
codellama_create_fim_context(const char* model_path, const char** error_message) { | ||
struct llama_context_params params = llama_context_default_params(); | ||
params.use_mlock = 1; | ||
struct llama_model* model = llama_load_model_from_file(model_path, params); | ||
if (!model) { | ||
*error_message = "Failed to load model."; | ||
return NULL; | ||
} | ||
|
||
struct llama_context* context = llama_new_context_with_model(model, params); | ||
if (!context) { | ||
*error_message = "Failed to create context."; | ||
llama_free_model(model); | ||
return NULL; | ||
} | ||
|
||
return context; | ||
} | ||
|
||
static inline char* | ||
codellama_fill_in_middle(struct llama_context* ctx, const char* prefix, const char* suffix, size_t n_max_tokens, int n_threads, bool spm, const char** error_message) { | ||
|
||
int num_tokens; | ||
size_t combined_len = strlen(prefix) + strlen(suffix) + 3; | ||
size_t initial_size = sizeof(llama_token) * combined_len; | ||
llama_token* tokens_end = (llama_token*)malloc(initial_size); | ||
llama_token* tokens = tokens_end; | ||
if (!tokens) { | ||
*error_message = "Failed to allocate memory for tokens."; | ||
return NULL; | ||
} | ||
|
||
// Append first part of prompt | ||
*tokens_end++ = spm ? llama_token_suffix(ctx) : llama_token_prefix(ctx); | ||
tokens_end += num_tokens = llama_tokenize(ctx, spm ? suffix : prefix, tokens_end, n_max_tokens, 0); | ||
if (num_tokens < 0) { | ||
*error_message = "Failed to tokenize the prompt."; | ||
free(tokens); | ||
return NULL; | ||
} | ||
|
||
// Append second part of prompt | ||
*tokens_end++ = spm ? llama_token_prefix(ctx) : llama_token_suffix(ctx); | ||
tokens_end += num_tokens = llama_tokenize(ctx, spm ? prefix : suffix, tokens_end, n_max_tokens, 0); | ||
if (num_tokens < 0) { | ||
*error_message = "Failed to tokenize the prompt."; | ||
free(tokens); | ||
return NULL; | ||
} | ||
|
||
// Append middle token | ||
*tokens_end++ = llama_token_middle(ctx); | ||
|
||
// Grow to accommodate the prompt and the max amount of generated tokens | ||
size_t prompt_len = (size_t)(tokens_end - tokens); | ||
size_t min_len = (prompt_len + n_max_tokens); | ||
if (min_len > combined_len) { | ||
llama_token* new_tokens = (llama_token*)realloc(tokens, sizeof(llama_token) * min_len); | ||
if (!new_tokens) { | ||
*error_message = "Failed to allocate memory for tokens."; | ||
free(tokens); | ||
return NULL; | ||
} | ||
tokens = new_tokens; | ||
} | ||
|
||
// Evaluate the LM on the prompt. | ||
if (llama_eval(ctx, tokens, prompt_len, 0, n_threads)) { | ||
*error_message = "Failed to evaluate the prompt."; | ||
free(tokens); | ||
return NULL; | ||
} | ||
|
||
// Generate tokens until n_max_tokens or the <EOT> token is generated. | ||
llama_token* generated_tokens = tokens + prompt_len; | ||
size_t num_generated_tokens = 0; | ||
int vocab_size = llama_n_vocab(ctx); | ||
for (size_t i = 0; i < n_max_tokens; i++) { | ||
// Evaluate the LM for a single token, obtaining the logits and probabilities. | ||
if (llama_eval(ctx, &generated_tokens[num_generated_tokens], 1, (int)num_generated_tokens, n_threads)) { | ||
*error_message = "Failed to evaluate the prompt."; | ||
free(tokens); | ||
break; | ||
} | ||
float* logits = llama_get_logits(ctx); | ||
|
||
// From the logits, select the most likely token. | ||
float highest_log_likelihood = -1; | ||
llama_token likeliest_token = -1; | ||
for (llama_token token_id = 0; token_id < vocab_size; token_id++) { | ||
if (logits[token_id] > highest_log_likelihood) { | ||
highest_log_likelihood = logits[token_id]; | ||
likeliest_token = token_id; | ||
} | ||
} | ||
|
||
// Don't add the token if it's <EOT>. | ||
if (likeliest_token == llama_token_eot(ctx)) { | ||
break; | ||
} | ||
|
||
// Append the token, so it's there for subsequent evaluations. | ||
generated_tokens[num_generated_tokens++] = likeliest_token; | ||
|
||
// Translate the token to a string. | ||
char cs[20] = {0}; | ||
int token_length = llama_token_to_piece(ctx, likeliest_token, cs, 20); | ||
cs[token_length] = '\0'; | ||
printf("%s\n", cs); | ||
} | ||
|
||
// Allocate memory for the final result | ||
size_t result_length = 0; | ||
size_t result_capacity = 4096; | ||
char* result = (char*)malloc(sizeof(char) * result_capacity); | ||
if (!result) { | ||
*error_message = "Failed to allocate memory for result."; | ||
free(tokens); | ||
return NULL; | ||
} | ||
|
||
// Translate tokens to string, growing the allocation if it's too small. | ||
for (size_t i = 0; i < num_generated_tokens; i++) { | ||
int appended = llama_token_to_piece(ctx, generated_tokens[i], result, result_capacity - result_length); | ||
if (appended < 0) { | ||
i--; // retry the token with a larger buffer | ||
size_t new_capacity = result_capacity * 2; | ||
char* new_result = (char*)realloc(result, sizeof(char) * new_capacity); | ||
if (!new_result) { | ||
*error_message = "Failed to allocate memory for result."; | ||
free(tokens); | ||
free(result); | ||
return NULL; | ||
} | ||
result = new_result; | ||
result_capacity = new_capacity; | ||
} | ||
|
||
result_length += appended; | ||
} | ||
|
||
free(tokens); | ||
*error_message = NULL; | ||
return result; | ||
} | ||
|
||
int main(int argc, char** argv) { | ||
if (argc != 6) { | ||
fprintf(stderr, "Usage: %s <model> <prefix> <suffix> <n_max_tokens> <n_threads>\n", argv[0]); | ||
return 1; | ||
} | ||
|
||
char* model = argv[1]; | ||
char* prefix = argv[2]; | ||
char* suffix = argv[3]; | ||
size_t n_max_tokens = atoi(argv[4]) > 0 ? atoi(argv[4]) : 64; | ||
int n_threads = atoi(argv[5]); | ||
bool spm = false; | ||
const char* error_message = NULL; | ||
|
||
puts("Loading the model. This could take quite a while..."); | ||
struct llama_context* ctx = codellama_create_fim_context(model, &error_message); | ||
if (error_message) { | ||
fprintf(stderr, "Error: %s\n", error_message); | ||
return 1; | ||
} | ||
|
||
puts("Model loaded. Generating text..."); | ||
char* result = codellama_fill_in_middle(ctx, prefix, suffix, n_max_tokens, n_threads, spm, &error_message); | ||
if (error_message) { | ||
fprintf(stderr, "Error: %s\n", error_message); | ||
return 1; | ||
} | ||
|
||
puts("Generated text:"); | ||
printf("%s%s%s\n", prefix, result, suffix); | ||
|
||
free(result); | ||
llama_free(ctx); | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
|
||
# Example | ||
|
||
The FIM (Fill-In-Middle) objective is useful for generating text conditioned on a prefix and a suffix. | ||
This example is for use with codellama, for doing exactly that. | ||
|
||
For a quick summary of what's going on here, see issue #2818, and/or read [the FIM paper](https://arxiv.org/abs/2207.14255). | ||
|
||
``` | ||
Usage: ./fill-in-middle <model> <prefix> <suffix> <n_max_tokens> <n_threads> | ||
``` | ||
```sh | ||
./fill-in-middle \ | ||
CodeLlama-34B-GGUF/codellama-34b.Q4_K_S.gguf \ | ||
$'def add(a, b):\n' \ | ||
$'\n' \ | ||
64 \ | ||
4 | ||
``` | ||
|
||
With prefix: | ||
```py | ||
def add(a, b): | ||
|
||
``` | ||
|
||
And a newline as suffix: | ||
```py | ||
|
||
``` | ||
|
||
We can expect it to generate somethng like: | ||
```py | ||
return a + b | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, does Code Llama 34B have these special tokens?
If it does not, then how would FIM work with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, yeah. These are new, and I think only in codellama. I don't think they're in llama2. To get the token ids themselves, the codellama people run the tokenizer, and these are the values that came out.
https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/tokenizer.py#L28-L31
It should work. But I've been busy with my day job, and haven't gotten a chance to test it yet. Definitely not going to suggest merging until I'm certain.