Skip to content

Commit f7b9bf1

Browse files
committed
Merge branch 'master' of github.com:ggerganov/llama.cpp into grammar-example
* 'master' of github.com:ggerganov/llama.cpp: py : change version of numpy requirement to 1.24.4 (ggml-org#3515) quantize : fail fast on write errors (ggml-org#3521) metal : support default.metallib load & reuse code for swift package (ggml-org#3522) llm : support Adept Persimmon 8B (ggml-org#3410) Fix for ggml-org#3454 (ggml-org#3455) readme : update models, cuda + ppl instructions (ggml-org#3510) server : docs fix default values and add n_probs (ggml-org#3506)
2 parents 8369908 + c47066d commit f7b9bf1

10 files changed

+915
-136
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
*.gcno
1111
*.gcda
1212
*.dot
13+
*.metallib
1314
.DS_Store
1415
.build/
1516
.cache/

Package.swift

+6-2
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,18 @@ let platforms: [SupportedPlatform]? = [
1010
.tvOS(.v14)
1111
]
1212
let exclude: [String] = []
13-
let additionalSources: [String] = ["ggml-metal.m", "ggml-metal.metal"]
13+
let resources: [Resource] = [
14+
.process("ggml-metal.metal")
15+
]
16+
let additionalSources: [String] = ["ggml-metal.m"]
1417
let additionalSettings: [CSetting] = [
1518
.unsafeFlags(["-fno-objc-arc"]),
16-
.define("GGML_SWIFT"),
1719
.define("GGML_USE_METAL")
1820
]
1921
#else
2022
let platforms: [SupportedPlatform]? = nil
2123
let exclude: [String] = ["ggml-metal.metal"]
24+
let resources: [Resource] = []
2225
let additionalSources: [String] = []
2326
let additionalSettings: [CSetting] = []
2427
#endif
@@ -40,6 +43,7 @@ let package = Package(
4043
"ggml-alloc.c",
4144
"k_quants.c",
4245
] + additionalSources,
46+
resources: resources,
4347
publicHeadersPath: "spm-headers",
4448
cSettings: [
4549
.unsafeFlags(["-Wno-shorten-64-to-32"]),

README.md

+14-13
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ as the main playground for developing new features for the [ggml](https://github
9595
- [X] [Aquila-7B](https://huggingface.co/BAAI/Aquila-7B) / [AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
9696
- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
9797
- [X] [Mistral AI v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
98+
- [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
9899

99100
**Bindings:**
100101

@@ -377,7 +378,7 @@ Building the program with BLAS support may lead to some performance improvements
377378
378379
- #### cuBLAS
379380
380-
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
381+
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
381382
- Using `make`:
382383
```bash
383384
make LLAMA_CUBLAS=1
@@ -613,6 +614,18 @@ For more information, see [https://huggingface.co/docs/transformers/perplexity](
613614
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
614615
The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
615616

617+
#### How to run
618+
619+
1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
620+
2. Run `./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
621+
3. Output:
622+
```
623+
perplexity : calculating perplexity over 655 chunks
624+
24.43 seconds per pass - ETA 4.45 hours
625+
[1]4.5970,[2]5.1807,[3]6.0382,...
626+
```
627+
And after 4.45 hours, you will have the final perplexity.
628+
616629
### Interactive mode
617630

618631
If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
@@ -775,18 +788,6 @@ If your issue is with model generation quality, then please at least scan the fo
775788
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
776789
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
777790

778-
#### How to run
779-
780-
1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
781-
2. Run `./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
782-
3. Output:
783-
```
784-
perplexity : calculating perplexity over 655 chunks
785-
24.43 seconds per pass - ETA 4.45 hours
786-
[1]4.5970,[2]5.1807,[3]6.0382,...
787-
```
788-
And after 4.45 hours, you will have the final perplexity.
789-
790791
### Android
791792

792793
#### Building the Project using Android NDK

convert-persimmon-to-gguf.py

+130
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
import torch
2+
import os
3+
from pprint import pprint
4+
import sys
5+
import argparse
6+
from pathlib import Path
7+
from sentencepiece import SentencePieceProcessor
8+
if 'NO_LOCAL_GGUF' not in os.environ:
9+
sys.path.insert(1, str(Path(__file__).parent / 'gguf-py' / 'gguf'))
10+
import gguf
11+
12+
def _flatten_dict(dct, tensors, prefix=None):
13+
assert isinstance(dct, dict)
14+
for key in dct.keys():
15+
new_prefix = prefix + '.' + key if prefix is not None else key
16+
if isinstance(dct[key], torch.Tensor):
17+
tensors[new_prefix] = dct[key]
18+
elif isinstance(dct[key], dict):
19+
_flatten_dict(dct[key], tensors, new_prefix)
20+
else:
21+
raise ValueError(type(dct[key]))
22+
return None
23+
24+
def _get_sentencepiece_tokenizer_info(dir_model: Path):
25+
tokenizer_path = dir_model / 'adept_vocab.model'
26+
print('gguf: getting sentencepiece tokenizer from', tokenizer_path)
27+
tokenizer = SentencePieceProcessor(str(tokenizer_path))
28+
print('gguf: adding tokens')
29+
tokens: list[bytes] = []
30+
scores: list[float] = []
31+
toktypes: list[int] = []
32+
33+
for i in range(tokenizer.vocab_size()):
34+
text: bytes
35+
score: float
36+
37+
piece = tokenizer.id_to_piece(i)
38+
text = piece.encode("utf-8")
39+
score = tokenizer.get_score(i)
40+
41+
toktype = 1
42+
if tokenizer.is_unknown(i):
43+
toktype = 2
44+
if tokenizer.is_control(i):
45+
toktype = 3
46+
if tokenizer.is_unused(i):
47+
toktype = 5
48+
if tokenizer.is_byte(i):
49+
toktype = 6
50+
51+
tokens.append(text)
52+
scores.append(score)
53+
toktypes.append(toktype)
54+
pass
55+
return tokens, scores, toktypes
56+
57+
def main():
58+
parser = argparse.ArgumentParser(description="Convert a Persimmon model from Adept (e.g. Persimmon 8b chat) to a GGML compatible file")
59+
parser.add_argument("--outfile", type=Path, help="path to write to; default: based on input")
60+
parser.add_argument("--ckpt-path", type=Path, help="path to persimmon checkpoint .pt file")
61+
parser.add_argument("--model-dir", type=Path, help="directory containing model e.g. 8b_chat_model_release")
62+
parser.add_argument("--adept-inference-dir", type=str, help="path to adept-inference code directory")
63+
args = parser.parse_args()
64+
sys.path.append(str(args.adept_inference_dir))
65+
persimmon_model = torch.load(args.ckpt_path)
66+
hparams = persimmon_model['args']
67+
pprint(hparams)
68+
tensors = {}
69+
_flatten_dict(persimmon_model['model'], tensors, None)
70+
71+
arch = gguf.MODEL_ARCH.PERSIMMON
72+
gguf_writer = gguf.GGUFWriter(args.outfile, gguf.MODEL_ARCH_NAMES[arch])
73+
74+
block_count = hparams.num_layers
75+
head_count = hparams.num_attention_heads
76+
head_count_kv = head_count
77+
ctx_length = hparams.seq_length
78+
hidden_size = hparams.hidden_size
79+
80+
gguf_writer.add_name('persimmon-8b-chat')
81+
gguf_writer.add_context_length(ctx_length)
82+
gguf_writer.add_embedding_length(hidden_size)
83+
gguf_writer.add_block_count(block_count)
84+
gguf_writer.add_feed_forward_length(hparams.ffn_hidden_size)
85+
gguf_writer.add_rope_dimension_count(hidden_size // head_count)
86+
gguf_writer.add_head_count(head_count)
87+
gguf_writer.add_head_count_kv(head_count_kv)
88+
gguf_writer.add_rope_freq_base(hparams.rotary_emb_base)
89+
gguf_writer.add_layer_norm_eps(hparams.layernorm_epsilon)
90+
91+
tokens, scores, toktypes = _get_sentencepiece_tokenizer_info(args.model_dir)
92+
gguf_writer.add_tokenizer_model('llama')
93+
gguf_writer.add_token_list(tokens)
94+
gguf_writer.add_token_scores(scores)
95+
gguf_writer.add_token_types(toktypes)
96+
gguf_writer.add_bos_token_id(71013)
97+
gguf_writer.add_eos_token_id(71013)
98+
99+
tensor_map = gguf.get_tensor_name_map(arch, block_count)
100+
print(tensor_map)
101+
for name in tensors.keys():
102+
data = tensors[name]
103+
if name.endswith(".self_attention.rotary_emb.inv_freq"):
104+
continue
105+
old_dtype = data.dtype
106+
# TODO: FP16 conversion produces garbage outputs. (Q8_0 does not, so..?)
107+
data = data.to(torch.float32).squeeze().numpy()
108+
new_name = tensor_map.get_name(name, try_suffixes = (".weight", ".bias"))
109+
if new_name is None:
110+
print("Can not map tensor '" + name + "'")
111+
sys.exit()
112+
n_dims = len(data.shape)
113+
print(new_name + ", n_dims = " + str(n_dims) + ", " + str(old_dtype) + " --> " + str(data.dtype))
114+
gguf_writer.add_tensor(new_name, data)
115+
print("gguf: write header")
116+
gguf_writer.write_header_to_file()
117+
print("gguf: write metadata")
118+
gguf_writer.write_kv_data_to_file()
119+
print("gguf: write tensors")
120+
gguf_writer.write_tensors_to_file()
121+
122+
gguf_writer.close()
123+
124+
print(f"gguf: model successfully exported to '{args.outfile}'")
125+
print("")
126+
127+
128+
129+
if __name__ == '__main__':
130+
main()

examples/server/README.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -114,9 +114,9 @@ node index.js
114114

115115
`top_k`: Limit the next token selection to the K most probable tokens (default: 40).
116116

117-
`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
117+
`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95).
118118

119-
`n_predict`: Set the number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: 128, -1 = infinity).
119+
`n_predict`: Set the number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity).
120120

121121
`n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context.
122122
By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
@@ -156,6 +156,8 @@ node index.js
156156

157157
`logit_bias`: Modify the likelihood of a token appearing in the generated text completion. For example, use `"logit_bias": [[15043,1.0]]` to increase the likelihood of the token 'Hello', or `"logit_bias": [[15043,-1.0]]` to decrease its likelihood. Setting the value to false, `"logit_bias": [[15043,false]]` ensures that the token `Hello` is never produced (default: []).
158158

159+
`n_probs`: If greater than 0, the response also contains the probabilities of top N tokens for each generated token (default: 0)
160+
159161
- **POST** `/tokenize`: Tokenize a given text.
160162

161163
*Options:*

0 commit comments

Comments
 (0)