Skip to content

Commit 384fc72

Browse files
committed
Merge branch 'master' of github.com:ggerganov/llama.cpp
* 'master' of github.com:ggerganov/llama.cpp: fix embeddings when using CUDA (ggml-org#3657) llama : avoid fprintf in favor of LLAMA_LOG (ggml-org#3538) readme : update hot-topics & models, detail windows release in usage (ggml-org#3615) CLBlast: Fix temporary buffer size for f16 conversion (wsize) train-text-from-scratch : fix assert failure in ggml-alloc (ggml-org#3618) editorconfig : remove trailing spaces server : documentation of JSON return value of /completion endpoint (ggml-org#3632) save-load-state : fix example + add ci test (ggml-org#3655) readme : add Aquila2 links (ggml-org#3610) tokenizer : special token handling (ggml-org#3538) k-quants : fix quantization ranges (ggml-org#3646) llava : fix tokenization to not add bos between image embeddings and user prompt (ggml-org#3645) MPT : support GQA for replit-code-v1.5 (ggml-org#3627) Honor -ngl option for Cuda offloading in llava (ggml-org#3621)
2 parents 9faa285 + cb33f43 commit 384fc72

File tree

17 files changed

+500
-151
lines changed

17 files changed

+500
-151
lines changed

README.md

+20-6
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
1111

1212
### Hot topics
13-
13+
- ‼️ BPE tokenizer update: existing Falcon and Starcoder `.gguf` models will need to be reconverted: [#3252](https://github.com/ggerganov/llama.cpp/pull/3252)
1414
- ‼️ Breaking change: `rope_freq_base` and `rope_freq_scale` must be set to zero to use the model default values: [#3401](https://github.com/ggerganov/llama.cpp/pull/3401)
1515
- Parallel decoding + continuous batching support added: [#3228](https://github.com/ggerganov/llama.cpp/pull/3228) \
1616
**Devs should become familiar with the new API**
@@ -89,15 +89,17 @@ as the main playground for developing new features for the [ggml](https://github
8989
- [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
9090
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
9191
- [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
92-
- [X] [Pygmalion 7B / Metharme 7B](#using-pygmalion-7b--metharme-7b)
92+
- [X] [Pygmalion/Metharme](#using-pygmalion-7b--metharme-7b)
9393
- [X] [WizardLM](https://github.com/nlpxucan/WizardLM)
94-
- [X] [Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) and its derivations (such as [baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft))
95-
- [X] [Aquila-7B](https://huggingface.co/BAAI/Aquila-7B) / [AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
94+
- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
95+
- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
9696
- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
9797
- [X] [Mistral AI v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
9898
- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
99-
- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
99+
- [X] [Persimmon 8B](https://github.com/ggerganov/llama.cpp/pull/3410)
100100
- [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
101+
- [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
102+
101103

102104
**Bindings:**
103105

@@ -206,7 +208,7 @@ https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8
206208
207209
## Usage
208210
209-
Here are the steps for the LLaMA-7B model.
211+
Here are the end-to-end binary build and model conversion steps for the LLaMA-7B model.
210212
211213
### Get the Code
212214
@@ -573,6 +575,18 @@ python3 convert.py models/7B/
573575

574576
When running the larger models, make sure you have enough disk space to store all the intermediate files.
575577

578+
### Running on Windows with prebuilt binaries
579+
580+
You will find prebuilt Windows binaries on the release page.
581+
582+
Simply download and extract the latest zip package of choice: (e.g. `llama-b1380-bin-win-avx2-x64.zip`)
583+
584+
From the unzipped folder, open a terminal/cmd window here and place a pre-converted `.gguf` model file. Test out the main example like so:
585+
586+
```
587+
.\main -m llama-2-7b.Q4_0.gguf -n 128
588+
```
589+
576590
### Memory/Disk Requirements
577591

578592
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

ci/run.sh

+6
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,8 @@ function gg_run_open_llama_3b_v2 {
208208
(time ./bin/perplexity --model ${model_q5_k} -f ${wiki_test_60} -c 128 -b 128 --chunks 2 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
209209
(time ./bin/perplexity --model ${model_q6_k} -f ${wiki_test_60} -c 128 -b 128 --chunks 2 ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
210210

211+
(time ./bin/save-load-state --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
212+
211213
function check_ppl {
212214
qnt="$1"
213215
ppl=$(echo "$2" | grep -oE "[0-9]+\.[0-9]+" | tail -n 1)
@@ -296,6 +298,7 @@ function gg_sum_open_llama_3b_v2 {
296298
gg_printf '- q4_k:\n```\n%s\n```\n' "$(cat $OUT/${ci}-tg-q4_k.log)"
297299
gg_printf '- q5_k:\n```\n%s\n```\n' "$(cat $OUT/${ci}-tg-q5_k.log)"
298300
gg_printf '- q6_k:\n```\n%s\n```\n' "$(cat $OUT/${ci}-tg-q6_k.log)"
301+
gg_printf '- save-load-state: \n```\n%s\n```\n' "$(cat $OUT/${ci}-save-load-state.log)"
299302
gg_printf '- shakespeare (f16):\n```\n%s\n```\n' "$(cat $OUT/${ci}-ppl-shakespeare-f16.log)"
300303
gg_printf '- shakespeare (f16 lora):\n```\n%s\n```\n' "$(cat $OUT/${ci}-ppl-shakespeare-lora-f16.log)"
301304
gg_printf '- shakespeare (q8_0):\n```\n%s\n```\n' "$(cat $OUT/${ci}-ppl-shakespeare-q8_0.log)"
@@ -382,6 +385,8 @@ function gg_run_open_llama_7b_v2 {
382385
(time ./bin/perplexity --model ${model_q5_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
383386
(time ./bin/perplexity --model ${model_q6_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
384387

388+
(time ./bin/save-load-state --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
389+
385390
function check_ppl {
386391
qnt="$1"
387392
ppl=$(echo "$2" | grep -oE "[0-9]+\.[0-9]+" | tail -n 1)
@@ -470,6 +475,7 @@ function gg_sum_open_llama_7b_v2 {
470475
gg_printf '- q4_k:\n```\n%s\n```\n' "$(cat $OUT/${ci}-tg-q4_k.log)"
471476
gg_printf '- q5_k:\n```\n%s\n```\n' "$(cat $OUT/${ci}-tg-q5_k.log)"
472477
gg_printf '- q6_k:\n```\n%s\n```\n' "$(cat $OUT/${ci}-tg-q6_k.log)"
478+
gg_printf '- save-load-state: \n```\n%s\n```\n' "$(cat $OUT/${ci}-save-load-state.log)"
473479
gg_printf '- shakespeare (f16):\n```\n%s\n```\n' "$(cat $OUT/${ci}-ppl-shakespeare-f16.log)"
474480
gg_printf '- shakespeare (f16 lora):\n```\n%s\n```\n' "$(cat $OUT/${ci}-ppl-shakespeare-lora-f16.log)"
475481
#gg_printf '- shakespeare (q8_0):\n```\n%s\n```\n' "$(cat $OUT/${ci}-ppl-shakespeare-q8_0.log)"

common/common.cpp

+7-5
Original file line numberDiff line numberDiff line change
@@ -879,21 +879,23 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
879879
std::vector<llama_token> llama_tokenize(
880880
const struct llama_context * ctx,
881881
const std::string & text,
882-
bool add_bos) {
883-
return llama_tokenize(llama_get_model(ctx), text, add_bos);
882+
bool add_bos,
883+
bool special) {
884+
return llama_tokenize(llama_get_model(ctx), text, add_bos, special);
884885
}
885886

886887
std::vector<llama_token> llama_tokenize(
887888
const struct llama_model * model,
888889
const std::string & text,
889-
bool add_bos) {
890+
bool add_bos,
891+
bool special) {
890892
// upper limit for the number of tokens
891893
int n_tokens = text.length() + add_bos;
892894
std::vector<llama_token> result(n_tokens);
893-
n_tokens = llama_tokenize(model, text.data(), text.length(), result.data(), result.size(), add_bos);
895+
n_tokens = llama_tokenize(model, text.data(), text.length(), result.data(), result.size(), add_bos, special);
894896
if (n_tokens < 0) {
895897
result.resize(-n_tokens);
896-
int check = llama_tokenize(model, text.data(), text.length(), result.data(), result.size(), add_bos);
898+
int check = llama_tokenize(model, text.data(), text.length(), result.data(), result.size(), add_bos, special);
897899
GGML_ASSERT(check == -n_tokens);
898900
} else {
899901
result.resize(n_tokens);

common/common.h

+4-2
Original file line numberDiff line numberDiff line change
@@ -137,12 +137,14 @@ struct llama_context_params llama_context_params_from_gpt_params(const gpt_param
137137
std::vector<llama_token> llama_tokenize(
138138
const struct llama_context * ctx,
139139
const std::string & text,
140-
bool add_bos);
140+
bool add_bos,
141+
bool special = false);
141142

142143
std::vector<llama_token> llama_tokenize(
143144
const struct llama_model * model,
144145
const std::string & text,
145-
bool add_bos);
146+
bool add_bos,
147+
bool special = false);
146148

147149
// tokenizes a token into a piece
148150
// should work similar to Python's `tokenizer.id_to_piece`

common/train.cpp

+4-4
Original file line numberDiff line numberDiff line change
@@ -863,7 +863,7 @@ size_t tokenize_file(
863863
(int) buf.size(),
864864
out_tokens.data(),
865865
(int) out_tokens.size(),
866-
false);
866+
false, false);
867867
if (n_tokens < 0) {
868868
out_tokens.resize(-n_tokens);
869869
n_tokens = llama_tokenize(
@@ -872,7 +872,7 @@ size_t tokenize_file(
872872
(int) buf.size(),
873873
out_tokens.data(),
874874
(int) out_tokens.size(),
875-
false);
875+
false, false);
876876
}
877877
if (n_tokens >= 0) {
878878
out_tokens.resize(n_tokens);
@@ -966,15 +966,15 @@ size_t tokenize_file(
966966
(int) buf_sample.size(),
967967
tok_sample.data(),
968968
(int) tok_sample.size(),
969-
false);
969+
false, false);
970970
if (n_tokens < 0) {
971971
tok_sample.resize(-n_tokens);
972972
n_tokens = llama_tokenize(llama_get_model(lctx),
973973
buf_sample.data(),
974974
(int) buf_sample.size(),
975975
tok_sample.data(),
976976
(int) tok_sample.size(),
977-
false);
977+
false, false);
978978
GGML_ASSERT(n_tokens >= 0);
979979
}
980980
GGML_ASSERT(n_tokens <= (int) tok_sample.size());

convert-mpt-hf-to-gguf.py

+2
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@ def parse_args() -> argparse.Namespace:
9898
gguf_writer.add_block_count(block_count)
9999
gguf_writer.add_feed_forward_length(4 * hparams["d_model"])
100100
gguf_writer.add_head_count(hparams["n_heads"])
101+
if kv_n_heads := hparams["attn_config"].get("kv_n_heads"):
102+
gguf_writer.add_head_count_kv(kv_n_heads)
101103
gguf_writer.add_layer_norm_eps(1e-05)
102104
if hparams["attn_config"]["clip_qkv"] is not None:
103105
gguf_writer.add_clamp_kqv(hparams["attn_config"]["clip_qkv"])

examples/batched.swift/Sources/main.swift

+1-1
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@ llama_print_timings(context)
209209
private func tokenize(text: String, add_bos: Bool) -> [llama_token] {
210210
let n_tokens = text.count + (add_bos ? 1 : 0)
211211
let tokens = UnsafeMutablePointer<llama_token>.allocate(capacity: n_tokens)
212-
let tokenCount = llama_tokenize(model, text, Int32(text.count), tokens, Int32(n_tokens), add_bos)
212+
let tokenCount = llama_tokenize(model, text, Int32(text.count), tokens, Int32(n_tokens), add_bos, /*special tokens*/ false)
213213
var swiftTokens: [llama_token] = []
214214
for i in 0 ..< tokenCount {
215215
swiftTokens.append(tokens[Int(i)])

examples/llava/llava-utils.h

+2-2
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,9 @@ inline bool eval_id(struct llama_context * ctx_llama, int id, int * n_past) {
4949
return eval_tokens(ctx_llama, tokens, 1, n_past);
5050
}
5151

52-
inline bool eval_string(struct llama_context * ctx_llama, const char* str, int n_batch, int * n_past){
52+
inline bool eval_string(struct llama_context * ctx_llama, const char* str, int n_batch, int * n_past, bool add_bos){
5353
std::string str2 = str;
54-
std::vector<llama_token> embd_inp = ::llama_tokenize(ctx_llama, str2, true);
54+
std::vector<llama_token> embd_inp = ::llama_tokenize(ctx_llama, str2, add_bos);
5555
eval_tokens(ctx_llama, embd_inp, n_batch, n_past);
5656
return true;
5757
}

examples/llava/llava.cpp

+14-6
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,13 @@ int main(int argc, char ** argv) {
7979

8080
llama_backend_init(params.numa);
8181

82-
llama_model_params model_params = llama_model_default_params();
82+
llama_model_params model_params = llama_model_default_params();
83+
model_params.n_gpu_layers = params.n_gpu_layers;
84+
model_params.main_gpu = params.main_gpu;
85+
model_params.tensor_split = params.tensor_split;
86+
model_params.use_mmap = params.use_mmap;
87+
model_params.use_mlock = params.use_mlock;
88+
8389
llama_model * model = llama_load_model_from_file(params.model.c_str(), model_params);
8490
if (model == NULL) {
8591
fprintf(stderr , "%s: error: unable to load model\n" , __func__);
@@ -91,6 +97,7 @@ int main(int argc, char ** argv) {
9197
ctx_params.n_ctx = params.n_ctx < 2048 ? 2048 : params.n_ctx; // we need a longer context size to process image embeddings
9298
ctx_params.n_threads = params.n_threads;
9399
ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
100+
ctx_params.seed = params.seed;
94101

95102
llama_context * ctx_llama = llama_new_context_with_model(model, ctx_params);
96103

@@ -100,7 +107,8 @@ int main(int argc, char ** argv) {
100107
}
101108

102109
// make sure that the correct mmproj was used, i.e., compare apples to apples
103-
int n_llama_embd = llama_n_embd(llama_get_model(ctx_llama));
110+
const int n_llama_embd = llama_n_embd(llama_get_model(ctx_llama));
111+
104112
if (n_img_embd != n_llama_embd) {
105113
printf("%s: embedding dim of the multimodal projector (%d) is not equal to that of LLaMA (%d). Make sure that you use the correct mmproj file.\n", __func__, n_img_embd, n_llama_embd);
106114

@@ -119,14 +127,14 @@ int main(int argc, char ** argv) {
119127

120128
const int max_tgt_len = params.n_predict < 0 ? 256 : params.n_predict;
121129

122-
// GG: are we sure that the should be a trailing whitespace at the end of this string?
123-
eval_string(ctx_llama, "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER: ", params.n_batch, &n_past);
130+
eval_string(ctx_llama, "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:", params.n_batch, &n_past, true);
124131
eval_image_embd(ctx_llama, image_embd, n_img_pos, params.n_batch, &n_past);
125-
eval_string(ctx_llama, params.prompt.c_str(), params.n_batch, &n_past);
126-
eval_string(ctx_llama, "\nASSISTANT:", params.n_batch, &n_past);
132+
eval_string(ctx_llama, (params.prompt + "\nASSISTANT:").c_str(), params.n_batch, &n_past, false);
127133

128134
// generate the response
129135

136+
printf("\n");
137+
printf("prompt: '%s'\n", params.prompt.c_str());
130138
printf("\n");
131139

132140
for (int i = 0; i < max_tgt_len; i++) {

examples/main/main.cpp

+30-10
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ int main(int argc, char ** argv) {
238238

239239
if (params.interactive_first || params.instruct || !params.prompt.empty() || session_tokens.empty()) {
240240
LOG("tokenize the prompt\n");
241-
embd_inp = ::llama_tokenize(ctx, params.prompt, add_bos);
241+
embd_inp = ::llama_tokenize(ctx, params.prompt, add_bos, true);
242242
} else {
243243
LOG("use session tokens\n");
244244
embd_inp = session_tokens;
@@ -260,10 +260,10 @@ int main(int argc, char ** argv) {
260260
if (ctx_guidance) {
261261
LOG("cfg_negative_prompt: \"%s\"\n", log_tostr(sparams.cfg_negative_prompt));
262262

263-
guidance_inp = ::llama_tokenize(ctx_guidance, sparams.cfg_negative_prompt, add_bos);
263+
guidance_inp = ::llama_tokenize(ctx_guidance, sparams.cfg_negative_prompt, add_bos, true);
264264
LOG("guidance_inp tokenized: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx_guidance, guidance_inp));
265265

266-
std::vector<llama_token> original_inp = ::llama_tokenize(ctx, params.prompt, add_bos);
266+
std::vector<llama_token> original_inp = ::llama_tokenize(ctx, params.prompt, add_bos, true);
267267
LOG("original_inp tokenized: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, original_inp));
268268

269269
original_prompt_len = original_inp.size();
@@ -320,8 +320,8 @@ int main(int argc, char ** argv) {
320320
}
321321

322322
// prefix & suffix for instruct mode
323-
const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", add_bos);
324-
const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
323+
const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", add_bos, true);
324+
const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false, true);
325325

326326
LOG("inp_pfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, inp_pfx));
327327
LOG("inp_sfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, inp_sfx));
@@ -383,6 +383,12 @@ int main(int argc, char ** argv) {
383383
if (!params.antiprompt.empty()) {
384384
for (const auto & antiprompt : params.antiprompt) {
385385
LOG_TEE("Reverse prompt: '%s'\n", antiprompt.c_str());
386+
if (params.verbose_prompt) {
387+
auto tmp = ::llama_tokenize(ctx, antiprompt, false, true);
388+
for (int i = 0; i < (int) tmp.size(); i++) {
389+
LOG_TEE("%6d -> '%s'\n", tmp[i], llama_token_to_piece(ctx, tmp[i]).c_str());
390+
}
391+
}
386392
}
387393
}
388394

@@ -392,10 +398,22 @@ int main(int argc, char ** argv) {
392398

393399
if (!params.input_prefix.empty()) {
394400
LOG_TEE("Input prefix: '%s'\n", params.input_prefix.c_str());
401+
if (params.verbose_prompt) {
402+
auto tmp = ::llama_tokenize(ctx, params.input_prefix, true, true);
403+
for (int i = 0; i < (int) tmp.size(); i++) {
404+
LOG_TEE("%6d -> '%s'\n", tmp[i], llama_token_to_piece(ctx, tmp[i]).c_str());
405+
}
406+
}
395407
}
396408

397409
if (!params.input_suffix.empty()) {
398410
LOG_TEE("Input suffix: '%s'\n", params.input_suffix.c_str());
411+
if (params.verbose_prompt) {
412+
auto tmp = ::llama_tokenize(ctx, params.input_suffix, false, true);
413+
for (int i = 0; i < (int) tmp.size(); i++) {
414+
LOG_TEE("%6d -> '%s'\n", tmp[i], llama_token_to_piece(ctx, tmp[i]).c_str());
415+
}
416+
}
399417
}
400418
}
401419
LOG_TEE("sampling: repeat_last_n = %d, repeat_penalty = %f, presence_penalty = %f, frequency_penalty = %f, top_k = %d, tfs_z = %f, top_p = %f, typical_p = %f, temp = %f, mirostat = %d, mirostat_lr = %f, mirostat_ent = %f\n",
@@ -717,7 +735,7 @@ int main(int argc, char ** argv) {
717735
if (params.interactive) {
718736
if (!params.antiprompt.empty()) {
719737
// tokenize and inject first reverse prompt
720-
const auto first_antiprompt = ::llama_tokenize(ctx, params.antiprompt.front(), false);
738+
const auto first_antiprompt = ::llama_tokenize(ctx, params.antiprompt.front(), false, true);
721739
embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
722740
is_antiprompt = true;
723741
}
@@ -744,8 +762,7 @@ int main(int argc, char ** argv) {
744762
std::string buffer;
745763
if (!params.input_prefix.empty()) {
746764
LOG("appending input prefix: '%s'\n", params.input_prefix.c_str());
747-
buffer += params.input_prefix;
748-
printf("%s", buffer.c_str());
765+
printf("%s", params.input_prefix.c_str());
749766
}
750767

751768
// color user input only
@@ -767,7 +784,6 @@ int main(int argc, char ** argv) {
767784
// append input suffix if any
768785
if (!params.input_suffix.empty()) {
769786
LOG("appending input suffix: '%s'\n", params.input_suffix.c_str());
770-
buffer += params.input_suffix;
771787
printf("%s", params.input_suffix.c_str());
772788
}
773789

@@ -782,10 +798,14 @@ int main(int argc, char ** argv) {
782798
embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
783799
}
784800

785-
const auto line_inp = ::llama_tokenize(ctx, buffer, false);
801+
const auto line_pfx = ::llama_tokenize(ctx, params.input_prefix, false, true);
802+
const auto line_inp = ::llama_tokenize(ctx, buffer, false, false);
803+
const auto line_sfx = ::llama_tokenize(ctx, params.input_suffix, false, true);
786804
LOG("input tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, line_inp));
787805

806+
embd_inp.insert(embd_inp.end(), line_pfx.begin(), line_pfx.end());
788807
embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
808+
embd_inp.insert(embd_inp.end(), line_sfx.begin(), line_sfx.end());
789809

790810
// instruct mode: insert response suffix
791811
if (params.instruct) {

0 commit comments

Comments
 (0)