Skip to content

Commit 6381d4e

Browse files
ggerganovmonatisklosaxgoerchslaren
authored
gguf : new file format with flexible meta data (beta) (#2398)
* gguf : first API pass * gguf : read header + meta data * gguf : read tensor info * gguf : initial model loading - not tested * gguf : add gguf_get_tensor_name() * gguf : do not support passing existing ggml_context to gguf_init * gguf : simplify gguf_get_val * gguf : gguf.c is now part of ggml.c * gguf : read / write sample models * gguf : add comments * refactor : reduce code duplication and better API (#2415) * gguf : expose the gguf_type enum through the API for now * gguf : add array support * gguf.py : some code style changes * convert.py : start a new simplified implementation by removing old stuff * convert.py : remove GGML vocab + other obsolete stuff * GGUF : write tensor (#2426) * WIP: Write tensor * GGUF : Support writing tensors in Python * refactor : rm unused import and upd todos * fix : fix errors upd writing example * rm example.gguf * gitignore *.gguf * undo formatting * gguf : add gguf_find_key (#2438) * gguf.cpp : find key example * ggml.h : add gguf_find_key * ggml.c : add gguf_find_key * gguf : fix writing tensors * gguf : do not hardcode tensor names to read * gguf : write sample tensors to read * gguf : add tokenization constants * quick and dirty conversion example * gguf : fix writing gguf arrays * gguf : write tensors one by one and code reuse * gguf : fix writing gguf arrays * gguf : write tensors one by one * gguf : write tensors one by one * gguf : write tokenizer data * gguf : upd gguf conversion script * Update convert-llama-h5-to-gguf.py * gguf : handle already encoded string * ggml.h : get array str and f32 * ggml.c : get arr str and f32 * gguf.py : support any type * Update convert-llama-h5-to-gguf.py * gguf : fix set is not subscriptable * gguf : update convert-llama-h5-to-gguf.py * constants.py : add layer norm eps * gguf.py : add layer norm eps and merges * ggml.h : increase GGML_MAX_NAME to 64 * ggml.c : add gguf_get_arr_n * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Makefile : add gptneox gguf example * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Update convert-llama-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * gguf : support custom alignment value * gguf : fix typo in function call * gguf : mmap tensor data example * fix : update convert-llama-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * convert-gptneox-h5-to-gguf.py : Special tokens * gptneox-main.cpp : special tokens * Update gptneox-main.cpp * constants.py : special tokens * gguf.py : accumulate kv and tensor info data + special tokens * convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens * gguf : gguf counterpart of llama-util.h * gguf-util.h : update note * convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens * convert-llama-h5-to-gguf.py : special tokens * Delete gptneox-common.cpp * Delete gptneox-common.h * convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer * gptneox-main.cpp : gpt2 bpe tokenizer * gpt2 bpe tokenizer (handles merges and unicode) * Makefile : remove gptneox-common * gguf.py : bytesarray for gpt2bpe tokenizer * cmpnct_gpt2bpe.hpp : comments * gguf.py : use custom alignment if present * gguf : minor stuff * Update gptneox-main.cpp * map tensor names * convert-gptneox-h5-to-gguf.py : map tensor names * convert-llama-h5-to-gguf.py : map tensor names * gptneox-main.cpp : map tensor names * gguf : start implementing libllama in GGUF (WIP) * gguf : start implementing libllama in GGUF (WIP) * rm binary commited by mistake * upd .gitignore * gguf : calculate n_mult * gguf : inference with 7B model working (WIP) * gguf : rm deprecated function * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : add gguf_get_kv_type * gguf : add gguf_get_kv_type * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver * gguf : rm references to old file formats * gguf : shorter name for member variable * gguf : rm redundant method * gguf : get rid of n_mult, read n_ff from file * Update gguf_tensor_map.py * Update gptneox-main.cpp * gguf : rm references to old file magics * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : quantization is working * gguf : roper closing of file * gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : no need to convert tensors twice * convert-llama-h5-to-gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : simplify nbytes * convert-llama-h5-to-gguf.py : simplify nbytes * gptneox-main.cpp : n_layer --> n_block * constants.py : n_layer --> n_block * gguf.py : n_layer --> n_block * convert-gptneox-h5-to-gguf.py : n_layer --> n_block * convert-llama-h5-to-gguf.py : n_layer --> n_block * gptneox-main.cpp : n_layer --> n_block * Update gguf_tensor_map.py * convert-gptneox-h5-to-gguf.py : load model in parts to save memory * convert-llama-h5-to-gguf.py : load model in parts to save memory * convert : write more metadata for LLaMA * convert : rm quantization version * convert-gptneox-h5-to-gguf.py : add file_type key * gptneox-main.cpp : add file_type key * fix conflicts * gguf : add todos and comments * convert-gptneox-h5-to-gguf.py : tensor name map changes * Create gguf_namemap.py : tensor name map changes * Delete gguf_tensor_map.py * gptneox-main.cpp : tensor name map changes * convert-llama-h5-to-gguf.py : fixes * gguf.py : dont add empty strings * simple : minor style changes * gguf : use UNIX line ending * Create convert-llama-7b-pth-to-gguf.py * llama : sync gguf-llama.cpp with latest llama.cpp (#2608) * llama : sync gguf-llama.cpp with latest llama.cpp * minor : indentation + assert * llama : refactor gguf_buffer and gguf_ctx_buffer * llama : minor * gitignore : add gptneox-main * llama : tokenizer fixes (#2549) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * convert : update convert-new.py with tokenizer fixes (#2614) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * llama : sync gguf-llama with llama (#2613) * llama : sync gguf-llama with llama * tests : fix build + warnings (test-tokenizer-1 still fails) * tests : fix wstring_convert * convert : fix layer names * llama : sync gguf-llama.cpp * convert : update HF converter to new tokenizer voodoo magics * llama : update tokenizer style * convert-llama-h5-to-gguf.py : add token types * constants.py : add token types * gguf.py : add token types * convert-llama-7b-pth-to-gguf.py : add token types * gguf-llama.cpp : fix n_head_kv * convert-llama-h5-to-gguf.py : add 70b gqa support * gguf.py : add tensor data layout * convert-llama-h5-to-gguf.py : add tensor data layout * convert-llama-7b-pth-to-gguf.py : add tensor data layout * gptneox-main.cpp : add tensor data layout * convert-llama-h5-to-gguf.py : clarify the reverse permute * llama : refactor model loading code (#2620) * llama : style formatting + remove helper methods * llama : fix quantization using gguf tool * llama : simplify gguf_file_saver * llama : fix method names * llama : simplify write_header() * llama : no need to pass full file loader to the file saver just gguf_ctx * llama : gguf_file_saver write I32 * llama : refactor tensor names (#2622) * gguf: update tensor names searched in quantization * gguf : define tensor names as constants * gguf : initial write API (not tested yet) * gguf : write to file API (not tested) * gguf : initial write API ready + example * gguf : fix header write * gguf : fixes + simplify example + add ggml_nbytes_pad() * gguf : minor * llama : replace gguf_file_saver with new gguf write API * gguf : streaming support when writing files * gguf : remove oboslete write methods * gguf : remove obosolete gguf_get_arr_xxx API * llama : simplify gguf_file_loader * llama : move hparams and vocab from gguf_file_loader to llama_model_loader * llama : merge gguf-util.h in llama.cpp * llama : reorder definitions in .cpp to match .h * llama : minor simplifications * llama : refactor llama_model_loader (WIP) wip : remove ggml_ctx from llama_model_loader wip : merge gguf_file_loader in llama_model_loader * llama : fix shape prints * llama : fix Windows build + fix norm_rms_eps key * llama : throw error on missing KV paris in model meta data * llama : improve printing + log meta data * llama : switch print order of meta data --------- Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> * gguf : deduplicate (#2629) * gguf : better type names * dedup : CPU + Metal is working * ggml : fix warnings about unused results * llama.cpp : fix line feed and compiler warning * llama : fix strncpy warning + note token_to_str does not write null * llama : restore the original load/save session implementation Will migrate this to GGUF in the future * convert-llama-h5-to-gguf.py : support alt ctx param name * ggml : assert when using ggml_mul with non-F32 src1 * examples : dedup simple --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> * gguf.py : merge all files in gguf.py * convert-new.py : pick #2427 for HF 70B support * examples/gguf : no need to keep q option for quantization any more * llama.cpp : print actual model size * llama.cpp : use ggml_elements() * convert-new.py : output gguf (#2635) * convert-new.py : output gguf (WIP) * convert-new.py : add gguf key-value pairs * llama : add hparams.ctx_train + no longer print ftype * convert-new.py : minor fixes * convert-new.py : vocab-only option should work now * llama : fix tokenizer to use llama_char_to_byte * tests : add new ggml-vocab-llama.gguf * convert-new.py : tensor name mapping * convert-new.py : add map for skipping tensor serialization * convert-new.py : convert script now works * gguf.py : pick some of the refactoring from #2644 * convert-new.py : minor fixes * convert.py : update to support GGUF output * Revert "ci : disable CI temporary to not waste energy" This reverts commit 7e82d25. * convert.py : n_head_kv optional and .gguf file extension * convert.py : better always have n_head_kv and default it to n_head * llama : sync with recent PRs on master * editorconfig : ignore models folder ggml-ci * ci : update ".bin" to ".gguf" extension ggml-ci * llama : fix llama_model_loader memory leak * gptneox : move as a WIP example * llama : fix lambda capture ggml-ci * ggml : fix bug in gguf_set_kv ggml-ci * common.h : .bin --> .gguf * quantize-stats.cpp : .bin --> .gguf * convert.py : fix HF tensor permuting / unpacking ggml-ci * llama.cpp : typo * llama : throw error if gguf fails to init from file ggml-ci * llama : fix tensor name grepping during quantization ggml-ci * gguf.py : write tensors in a single pass (#2644) * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : style fixes in simple conversion script * gguf : refactor gptneox conversion script * gguf : rename h5 to hf (for HuggingFace) * gguf : refactor pth to gguf conversion script * gguf : rm file_type key and method * gguf.py : fix vertical alignment * gguf.py : indentation --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * convert-gptneox-hf-to-gguf.py : fixes * gguf.py : gptneox mapping * convert-llama-hf-to-gguf.py : fixes * convert-llama-7b-pth-to-gguf.py : fixes * ggml.h : reverse GGUF_MAGIC * gguf.py : reverse GGUF_MAGIC * test-tokenizer-0.cpp : fix warning * llama.cpp : print kv general.name * llama.cpp : get special token kv and linefeed token id * llama : print number of tensors per type + print arch + style * tests : update vocab file with new magic * editorconfig : fix whitespaces * llama : re-order functions * llama : remove C++ API + reorganize common source in /common dir * llama : minor API updates * llama : avoid hardcoded special tokens * llama : fix MPI build ggml-ci * llama : introduce enum llama_vocab_type + remove hardcoded string constants * convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested * falcon-main.cpp : falcon inference example * convert-falcon-hf-to-gguf.py : remove extra kv * convert-gptneox-hf-to-gguf.py : remove extra kv * convert-llama-7b-pth-to-gguf.py : remove extra kv * convert-llama-hf-to-gguf.py : remove extra kv * gguf.py : fix for falcon 40b * falcon-main.cpp : fix for falcon 40b * convert-falcon-hf-to-gguf.py : update ref * convert-falcon-hf-to-gguf.py : add tensor data layout * cmpnct_gpt2bpe.hpp : fixes * falcon-main.cpp : fixes * gptneox-main.cpp : fixes * cmpnct_gpt2bpe.hpp : remove non-general stuff * Update examples/server/README.md Co-authored-by: slaren <slarengh@gmail.com> * cmpnct_gpt2bpe.hpp : cleanup * convert-llama-hf-to-gguf.py : special tokens * convert-llama-7b-pth-to-gguf.py : special tokens * convert-permute-debug.py : permute debug print * convert-permute-debug-master.py : permute debug for master * convert-permute-debug.py : change permute type of attn_q * convert.py : 70b model working (change attn_q permute) * Delete convert-permute-debug-master.py * Delete convert-permute-debug.py * convert-llama-hf-to-gguf.py : fix attn_q permute * gguf.py : fix rope scale kv * convert-llama-hf-to-gguf.py : rope scale and added tokens * convert-llama-7b-pth-to-gguf.py : rope scale and added tokens * llama.cpp : use rope scale kv * convert-llama-7b-pth-to-gguf.py : rope scale fix * convert-llama-hf-to-gguf.py : rope scale fix * py : fix whitespace * gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682) * First pass at converting GGMLv3 LLaMA models to GGUF * Cleanups, better output during conversion * Fix vocab space conversion logic * More vocab conversion fixes * Add description to converted GGUF files * Improve help text, expand warning * Allow specifying name and description for output GGUF * Allow overriding vocab and hyperparams from original model metadata * Use correct params override var name * Fix wrong type size for Q8_K Better handling of original style metadata * Set default value for gguf add_tensor raw_shape KW arg * llama : improve token type support (#2668) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * llama : add API for token type ggml-ci * tests : use new tokenizer type API (#2692) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * Improve commentary * Use token type API in test-tokenizer-1.cpp * py : cosmetics * readme : add notice about new file format ggml-ci --------- Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com> Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> Co-authored-by: goerch <jhr.walter@t-online.de> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
1 parent dadbed9 commit 6381d4e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+10090
-3025
lines changed

Diff for: .gitignore

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
*.o
22
*.a
33
*.so
4+
*.gguf
45
*.bin
56
.DS_Store
67
.build/
@@ -47,6 +48,8 @@ models-mnt
4748
/server
4849
/Pipfile
4950
/embd-input-test
51+
/gguf
52+
/gguf-llama-simple
5053
/libllama.so
5154
/llama-bench
5255
build-info.h
@@ -65,7 +68,6 @@ perf-*.txt
6568

6669
examples/jeopardy/results.txt
6770

68-
6971
pyproject.toml
7072
poetry.lock
7173
poetry.toml

Diff for: CMakeLists.txt

+11-2
Original file line numberDiff line numberDiff line change
@@ -497,9 +497,11 @@ else()
497497
endif()
498498

499499
#
500-
# Build libraries
500+
# libraries
501501
#
502502

503+
# ggml
504+
503505
add_library(ggml OBJECT
504506
ggml.c
505507
ggml.h
@@ -524,10 +526,11 @@ if (BUILD_SHARED_LIBS)
524526
install(TARGETS ggml_shared LIBRARY)
525527
endif()
526528

529+
# llama
530+
527531
add_library(llama
528532
llama.cpp
529533
llama.h
530-
llama-util.h
531534
)
532535

533536
target_include_directories(llama PUBLIC .)
@@ -546,6 +549,10 @@ if (BUILD_SHARED_LIBS)
546549
install(TARGETS llama LIBRARY)
547550
endif()
548551

552+
#
553+
# install
554+
#
555+
549556
include(GNUInstallDirs)
550557
install(
551558
FILES convert.py
@@ -584,6 +591,8 @@ endif()
584591
# programs, examples and tests
585592
#
586593

594+
add_subdirectory(common)
595+
587596
if (LLAMA_BUILD_TESTS AND NOT CMAKE_JS_VERSION)
588597
include(CTest)
589598
add_subdirectory(tests)

Diff for: Makefile

+13-10
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Define the default target now so that it is always the first target
2-
BUILD_TARGETS = main quantize quantize-stats perplexity embedding vdot train-text-from-scratch convert-llama2c-to-ggml simple server embd-input-test llama-bench
2+
BUILD_TARGETS = main quantize quantize-stats perplexity embedding vdot train-text-from-scratch convert-llama2c-to-ggml simple server embd-input-test gguf llama-bench
33

44
# Binaries only useful for tests
55
TEST_TARGETS = tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0
@@ -45,8 +45,8 @@ OPT = -Ofast
4545
else
4646
OPT = -O3
4747
endif
48-
CFLAGS = -I. $(OPT) -std=c11 -fPIC
49-
CXXFLAGS = -I. -I./examples $(OPT) -std=c++11 -fPIC
48+
CFLAGS = -I. $(OPT) -std=c11 -fPIC
49+
CXXFLAGS = -I. -I./common $(OPT) -std=c++11 -fPIC
5050
LDFLAGS =
5151

5252
ifdef LLAMA_DEBUG
@@ -329,23 +329,23 @@ ggml-alloc.o: ggml-alloc.c ggml.h ggml-alloc.h
329329

330330
OBJS += ggml-alloc.o
331331

332-
llama.o: llama.cpp ggml.h ggml-alloc.h ggml-cuda.h ggml-metal.h llama.h llama-util.h
332+
llama.o: llama.cpp ggml.h ggml-alloc.h ggml-cuda.h ggml-metal.h llama.h
333333
$(CXX) $(CXXFLAGS) -c $< -o $@
334334

335-
common.o: examples/common.cpp examples/common.h
335+
common.o: common/common.cpp common/common.h
336336
$(CXX) $(CXXFLAGS) -c $< -o $@
337337

338-
console.o: examples/console.cpp examples/console.h
338+
console.o: common/console.cpp common/console.h
339339
$(CXX) $(CXXFLAGS) -c $< -o $@
340340

341-
grammar-parser.o: examples/grammar-parser.cpp examples/grammar-parser.h
341+
grammar-parser.o: common/grammar-parser.cpp common/grammar-parser.h
342342
$(CXX) $(CXXFLAGS) -c $< -o $@
343343

344344
libllama.so: llama.o ggml.o $(OBJS)
345345
$(CXX) $(CXXFLAGS) -shared -fPIC -o $@ $^ $(LDFLAGS)
346346

347347
clean:
348-
rm -vf *.o *.so *.dll main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch convert-llama2c-to-ggml embd-input-test llama-bench build-info.h $(TEST_TARGETS)
348+
rm -vf *.o *.so *.dll main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch convert-llama2c-to-ggml embd-input-test gguf llama-bench build-info.h $(TEST_TARGETS)
349349

350350
#
351351
# Examples
@@ -385,7 +385,10 @@ $(LIB_PRE)embdinput$(DSO_EXT): examples/embd-input/embd-input.h examples/embd-in
385385
embd-input-test: $(LIB_PRE)embdinput$(DSO_EXT) examples/embd-input/embd-input-test.cpp build-info.h ggml.o llama.o common.o $(OBJS)
386386
$(CXX) $(CXXFLAGS) $(filter-out %$(DSO_EXT),$(filter-out %.h,$(filter-out %.hpp,$^))) -o $@ $(LDFLAGS) -L. -lembdinput
387387

388-
train-text-from-scratch: examples/train-text-from-scratch/train-text-from-scratch.cpp build-info.h ggml.o llama.o $(OBJS)
388+
gguf: examples/gguf/gguf.cpp build-info.h ggml.o llama.o $(OBJS)
389+
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
390+
391+
train-text-from-scratch: examples/train-text-from-scratch/train-text-from-scratch.cpp build-info.h ggml.o llama.o common.o $(OBJS)
389392
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
390393

391394
convert-llama2c-to-ggml: examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp build-info.h ggml.o llama.o $(OBJS)
@@ -418,7 +421,7 @@ vdot: pocs/vdot/vdot.cpp ggml.o $(OBJS)
418421
tests/test-llama-grammar: tests/test-llama-grammar.cpp build-info.h ggml.o llama.o common.o $(OBJS)
419422
$(CXX) $(CXXFLAGS) $(filter-out %.txt,$^) -o $@ $(LDFLAGS)
420423

421-
tests/test-grammar-parser: tests/test-grammar-parser.cpp examples/grammar-parser.cpp build-info.h ggml.o llama.o common.o $(OBJS)
424+
tests/test-grammar-parser: tests/test-grammar-parser.cpp build-info.h ggml.o llama.o common.o $(OBJS)
422425
$(CXX) $(CXXFLAGS) $(filter-out %.txt,$^) -o $@ $(LDFLAGS)
423426

424427
tests/test-double-float: tests/test-double-float.cpp build-info.h ggml.o llama.o common.o $(OBJS)

Diff for: README.md

+21-13
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,17 @@
99

1010
Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
1111

12-
### 🚧 Incoming breaking change + refactoring:
12+
### Hot topics
1313

14-
See PR https://github.com/ggerganov/llama.cpp/pull/2398 for more info.
14+
A new file format has been introduced: [GGUF](https://github.com/ggerganov/llama.cpp/pull/2398)
1515

16-
To devs: avoid making big changes to `llama.h` / `llama.cpp` until merged
16+
Last revision compatible with the old format: [dadbed9](https://github.com/ggerganov/llama.cpp/commit/dadbed99e65252d79f81101a392d0d6497b86caa)
17+
18+
### Current `master` should be considered in Beta - expect some issues for a few days!
19+
20+
### Be prepared to re-convert and / or re-quantize your GGUF models while this notice is up!
21+
22+
### Issues with non-GGUF models will be considered with low priority!
1723

1824
----
1925

@@ -291,7 +297,7 @@ When built with Metal support, you can enable GPU inference with the `--gpu-laye
291297
Any value larger than 0 will offload the computation to the GPU. For example:
292298
293299
```bash
294-
./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1
300+
./main -m ./models/7B/ggml-model-q4_0.gguf -n 128 -ngl 1
295301
```
296302
297303
### MPI Build
@@ -330,7 +336,7 @@ The above will distribute the computation across 2 processes on the first host a
330336
Finally, you're ready to run a computation using `mpirun`:
331337

332338
```bash
333-
mpirun -hostfile hostfile -n 3 ./main -m ./models/7B/ggml-model-q4_0.bin -n 128
339+
mpirun -hostfile hostfile -n 3 ./main -m ./models/7B/ggml-model-q4_0.gguf -n 128
334340
```
335341

336342
### BLAS Build
@@ -513,10 +519,10 @@ python3 convert.py models/7B/
513519
python convert.py models/7B/ --vocabtype bpe
514520
515521
# quantize the model to 4-bits (using q4_0 method)
516-
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
522+
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
517523
518524
# run the inference
519-
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
525+
./main -m ./models/7B/ggml-model-q4_0.gguf -n 128
520526
```
521527

522528
When running the larger models, make sure you have enough disk space to store all the intermediate files.
@@ -572,7 +578,7 @@ Here is an example of a few-shot interaction, invoked with the command
572578
./examples/chat-13B.sh
573579
574580
# custom arguments using a 13B model
575-
./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
581+
./main -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
576582
```
577583

578584
Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `main` example program.
@@ -635,6 +641,8 @@ OpenLLaMA is an openly licensed reproduction of Meta's original LLaMA model. It
635641
636642
### Using [GPT4All](https://github.com/nomic-ai/gpt4all)
637643
644+
*Note: these instructions are likely obsoleted by the GGUF update*
645+
638646
- Obtain the `tokenizer.model` file from LLaMA model and put it to `models`
639647
- Obtain the `added_tokens.json` file from Alpaca model and put it to `models`
640648
- Obtain the `gpt4all-lora-quantized.bin` file from GPT4All model and put it to `models/gpt4all-7B`
@@ -710,7 +718,7 @@ If your issue is with model generation quality, then please at least scan the fo
710718
#### How to run
711719
712720
1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
713-
2. Run `./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw`
721+
2. Run `./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
714722
3. Output:
715723
```
716724
perplexity : calculating perplexity over 655 chunks
@@ -809,13 +817,13 @@ docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-
809817
On completion, you are ready to play!
810818
811819
```bash
812-
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
820+
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
813821
```
814822
815823
or with a light image:
816824
817825
```bash
818-
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
826+
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
819827
```
820828
821829
### Docker With CUDA
@@ -846,8 +854,8 @@ The resulting images, are essentially the same as the non-CUDA images:
846854
After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.
847855

848856
```bash
849-
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
850-
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
857+
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
858+
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
851859
```
852860

853861
### Contributing

Diff for: ci/run.sh

+22-22
Original file line numberDiff line numberDiff line change
@@ -159,17 +159,17 @@ function gg_run_open_llama_3b_v2 {
159159

160160
python3 ../convert.py ${path_models}
161161

162-
model_f16="${path_models}/ggml-model-f16.bin"
163-
model_q8_0="${path_models}/ggml-model-q8_0.bin"
164-
model_q4_0="${path_models}/ggml-model-q4_0.bin"
165-
model_q4_1="${path_models}/ggml-model-q4_1.bin"
166-
model_q5_0="${path_models}/ggml-model-q5_0.bin"
167-
model_q5_1="${path_models}/ggml-model-q5_1.bin"
168-
model_q2_k="${path_models}/ggml-model-q2_k.bin"
169-
model_q3_k="${path_models}/ggml-model-q3_k.bin"
170-
model_q4_k="${path_models}/ggml-model-q4_k.bin"
171-
model_q5_k="${path_models}/ggml-model-q5_k.bin"
172-
model_q6_k="${path_models}/ggml-model-q6_k.bin"
162+
model_f16="${path_models}/ggml-model-f16.gguf"
163+
model_q8_0="${path_models}/ggml-model-q8_0.gguf"
164+
model_q4_0="${path_models}/ggml-model-q4_0.gguf"
165+
model_q4_1="${path_models}/ggml-model-q4_1.gguf"
166+
model_q5_0="${path_models}/ggml-model-q5_0.gguf"
167+
model_q5_1="${path_models}/ggml-model-q5_1.gguf"
168+
model_q2_k="${path_models}/ggml-model-q2_k.gguf"
169+
model_q3_k="${path_models}/ggml-model-q3_k.gguf"
170+
model_q4_k="${path_models}/ggml-model-q4_k.gguf"
171+
model_q5_k="${path_models}/ggml-model-q5_k.gguf"
172+
model_q6_k="${path_models}/ggml-model-q6_k.gguf"
173173

174174
wiki_test_60="${path_wiki}/wiki.test-60.raw"
175175

@@ -285,17 +285,17 @@ function gg_run_open_llama_7b_v2 {
285285

286286
python3 ../convert.py ${path_models}
287287

288-
model_f16="${path_models}/ggml-model-f16.bin"
289-
model_q8_0="${path_models}/ggml-model-q8_0.bin"
290-
model_q4_0="${path_models}/ggml-model-q4_0.bin"
291-
model_q4_1="${path_models}/ggml-model-q4_1.bin"
292-
model_q5_0="${path_models}/ggml-model-q5_0.bin"
293-
model_q5_1="${path_models}/ggml-model-q5_1.bin"
294-
model_q2_k="${path_models}/ggml-model-q2_k.bin"
295-
model_q3_k="${path_models}/ggml-model-q3_k.bin"
296-
model_q4_k="${path_models}/ggml-model-q4_k.bin"
297-
model_q5_k="${path_models}/ggml-model-q5_k.bin"
298-
model_q6_k="${path_models}/ggml-model-q6_k.bin"
288+
model_f16="${path_models}/ggml-model-f16.gguf"
289+
model_q8_0="${path_models}/ggml-model-q8_0.gguf"
290+
model_q4_0="${path_models}/ggml-model-q4_0.gguf"
291+
model_q4_1="${path_models}/ggml-model-q4_1.gguf"
292+
model_q5_0="${path_models}/ggml-model-q5_0.gguf"
293+
model_q5_1="${path_models}/ggml-model-q5_1.gguf"
294+
model_q2_k="${path_models}/ggml-model-q2_k.gguf"
295+
model_q3_k="${path_models}/ggml-model-q3_k.gguf"
296+
model_q4_k="${path_models}/ggml-model-q4_k.gguf"
297+
model_q5_k="${path_models}/ggml-model-q5_k.gguf"
298+
model_q6_k="${path_models}/ggml-model-q6_k.gguf"
299299

300300
wiki_test="${path_wiki}/wiki.test.raw"
301301

Diff for: common/CMakeLists.txt

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# common
2+
3+
set(TARGET common)
4+
5+
add_library(${TARGET} OBJECT
6+
common.h
7+
common.cpp
8+
console.h
9+
console.cpp
10+
grammar-parser.h
11+
grammar-parser.cpp
12+
)
13+
14+
if (BUILD_SHARED_LIBS)
15+
set_target_properties(${TARGET} PROPERTIES POSITION_INDEPENDENT_CODE ON)
16+
endif()
17+
18+
target_include_directories(${TARGET} PUBLIC .)
19+
target_compile_features(${TARGET} PUBLIC cxx_std_11)
20+
target_link_libraries(${TARGET} PRIVATE llama)

0 commit comments

Comments
 (0)