You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge branch 'master' of github.com:ggerganov/llama.cpp
* 'master' of github.com:ggerganov/llama.cpp:
fix embeddings when using CUDA (ggml-org#3657)
llama : avoid fprintf in favor of LLAMA_LOG (ggml-org#3538)
readme : update hot-topics & models, detail windows release in usage (ggml-org#3615)
CLBlast: Fix temporary buffer size for f16 conversion (wsize)
train-text-from-scratch : fix assert failure in ggml-alloc (ggml-org#3618)
editorconfig : remove trailing spaces
server : documentation of JSON return value of /completion endpoint (ggml-org#3632)
save-load-state : fix example + add ci test (ggml-org#3655)
readme : add Aquila2 links (ggml-org#3610)
tokenizer : special token handling (ggml-org#3538)
k-quants : fix quantization ranges (ggml-org#3646)
llava : fix tokenization to not add bos between image embeddings and user prompt (ggml-org#3645)
MPT : support GQA for replit-code-v1.5 (ggml-org#3627)
Honor -ngl option for Cuda offloading in llava (ggml-org#3621)
Copy file name to clipboardexpand all lines: README.md
+20-6
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@
10
10
Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
11
11
12
12
### Hot topics
13
-
13
+
- ‼️ BPE tokenizer update: existing Falcon and Starcoder `.gguf` models will need to be reconverted: [#3252](https://github.com/ggerganov/llama.cpp/pull/3252)
14
14
- ‼️ Breaking change: `rope_freq_base` and `rope_freq_scale` must be set to zero to use the model default values: [#3401](https://github.com/ggerganov/llama.cpp/pull/3401)
15
15
- Parallel decoding + continuous batching support added: [#3228](https://github.com/ggerganov/llama.cpp/pull/3228)\
16
16
**Devs should become familiar with the new API**
@@ -89,15 +89,17 @@ as the main playground for developing new features for the [ggml](https://github
-[X][Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) and its derivations (such as [baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft))
When running the larger models, make sure you have enough disk space to store all the intermediate files.
575
577
578
+
### Running on Windows with prebuilt binaries
579
+
580
+
You will find prebuilt Windows binaries on the release page.
581
+
582
+
Simply download and extract the latest zip package of choice: (e.g. `llama-b1380-bin-win-avx2-x64.zip`)
583
+
584
+
From the unzipped folder, open a terminal/cmd window here and place a pre-converted `.gguf` model file. Test out the main example like so:
585
+
586
+
```
587
+
.\main -m llama-2-7b.Q4_0.gguf -n 128
588
+
```
589
+
576
590
### Memory/DiskRequirements
577
591
578
592
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
printf("%s: embedding dim of the multimodal projector (%d) is not equal to that of LLaMA (%d). Make sure that you use the correct mmproj file.\n", __func__, n_img_embd, n_llama_embd);
// GG: are we sure that the should be a trailing whitespace at the end of this string?
123
-
eval_string(ctx_llama, "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER: ", params.n_batch, &n_past);
130
+
eval_string(ctx_llama, "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:", params.n_batch, &n_past, true);
0 commit comments