Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry 1118 #5

Merged
merged 581 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
581 commits
Select commit Hold shift + click to select a range
794ab95
ggml: unify backend logging mechanism (#9709)
bandoti Oct 3, 2024
417a60f
ggml-backend : add device description to CPU backend (#9720)
slaren Oct 3, 2024
e49b066
metal : fix compute pass descriptor autorelease crash (#9718)
jmousseau Oct 3, 2024
1db66e4
ggml: refactor cross entropy loss CPU impl. (ggml/976)
JohannesGaessler Oct 2, 2024
88c1e20
ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)
JohannesGaessler Oct 3, 2024
a6ddb63
sync : ggml
ggerganov Oct 3, 2024
57cc9ed
metal : remove abort (skip) (ggml/0)
ggerganov Oct 3, 2024
0ea000d
Fixed RNG seed docs (#9723)
d-kleine Oct 4, 2024
052a4f5
ci : fine-grant permission (#9710)
ngxson Oct 4, 2024
41b26c1
ggml : fixes after sync (ggml/983)
slaren Oct 4, 2024
be46474
ggml : fix typo in example usage ggml_gallocr_new (ggml/984)
danbev Oct 4, 2024
c814a97
sync : ggml
ggerganov Oct 4, 2024
85a4276
Add Llama Assistant (#9744)
vietanhdev Oct 4, 2024
4254a77
metal : zero-init buffer contexts (whisper/0)
ggerganov Oct 5, 2024
3d10f6a
sync : ggml
ggerganov Oct 5, 2024
3ee6af2
rerank : use [SEP] token instead of [BOS] (#9737)
ggerganov Oct 5, 2024
9bc55e4
vulkan : retry allocation with fallback flags (whisper/2451)
SRHMorris Oct 6, 2024
f14574e
sync : llama.cpp
ggerganov Oct 6, 2024
2cf7ec1
readme : fix typo [no ci]
ggerganov Oct 6, 2024
e81a97b
contrib : simplify + minor edits [no ci]
ggerganov Oct 6, 2024
ae65769
metal : single allocation of encode_async block (#9747)
ptsochantaris Oct 7, 2024
e73fcdf
ggml : add metal backend registry / device (#9713)
ggerganov Oct 7, 2024
69c1556
flake.lock: Update (#9753)
ggerganov Oct 7, 2024
5fdda9c
Update building for Android (#9672)
amqdn Oct 7, 2024
817e714
ggml : add backend registry / device interfaces to BLAS backend (#9752)
slaren Oct 7, 2024
62bc3ec
scripts : fix spelling typo in messages and comments (#9782)
standby24x7 Oct 8, 2024
fde264c
server : better security control for public deployments (#9776)
ngxson Oct 8, 2024
5dc351a
ggml : fix BLAS with unsupported types (#9775)
slaren Oct 8, 2024
4f12202
examples : remove llama.vim
ggerganov Oct 9, 2024
77279b7
perplexity : fix integer overflow (#9783)
ggerganov Oct 9, 2024
9116c63
cmake : do not build common library by default when standalone (#9804)
slaren Oct 9, 2024
1946d00
examples : do not use common library in simple example (#9803)
slaren Oct 10, 2024
0e9d3c9
musa: add docker image support (#9685)
yeahdongcn Oct 10, 2024
5aeb71f
rpc : add backend registry / device interfaces (#9812)
slaren Oct 10, 2024
bb7dd65
common : use common_ prefix for common library functions (#9805)
slaren Oct 10, 2024
f069615
ggml : move more prints to the ggml log system (#9839)
slaren Oct 11, 2024
aa09924
musa : update doc (#9856)
yeahdongcn Oct 12, 2024
9eb4ece
llama : improve infill support and special token detection (#9798)
ggerganov Oct 12, 2024
1b6a607
server : remove legacy system_prompt feature (#9857)
ggerganov Oct 12, 2024
e0c4a58
server : remove self-extend features (#9860)
ggerganov Oct 12, 2024
6722600
server : add option to time limit the generation phase (#9865)
ggerganov Oct 12, 2024
6b3c60e
flake.lock: Update (#9870)
ggerganov Oct 13, 2024
494e83c
server : reuse cached context chunks (#9866)
ggerganov Oct 13, 2024
0c2bdf4
server : accept extra_context for the infill endpoint (#9874)
ggerganov Oct 13, 2024
1d9ab05
Vectorize load instructions in dmmv f16 CUDA kernel (#9816)
agray3 Oct 14, 2024
bf9e483
server : handle "logprobs" field with false value (#9871)
VoidIsVoid Oct 14, 2024
1f5033b
readme : update bindings list (#9889)
srgtuszy Oct 15, 2024
ba04b1e
server : update preact (#9895)
ggerganov Oct 15, 2024
74dd761
sampling : add XTC sampler (#9742)
MaggotHATE Oct 15, 2024
3441e94
server : improve infill context reuse (#9894)
ggerganov Oct 15, 2024
51f742e
llama : add infill sampler (#9896)
ggerganov Oct 15, 2024
df726a7
[CANN] Fix cann compilation error (#9891)
leo-pony Oct 16, 2024
1fbf140
ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)
danbev Oct 9, 2024
e50b506
sync : ggml
ggerganov Oct 16, 2024
99640be
server : fix the disappearance of the end of the text (#9867)
z80maniac Oct 16, 2024
9fe3692
llama : add tensor name for "result_norm" (#9907)
MollySophia Oct 16, 2024
5c7f4a7
grammar : fix JSON Schema for string regex with top-level alt. (#9903)
jemc Oct 16, 2024
02c35e5
llava : fix typo in error message [no ci] (#9884)
danbev Oct 16, 2024
4c8564b
llama : suppress conversion from 'size_t' to 'int' (#9046)
danbev Oct 16, 2024
d916edf
fix: use `vm_allocate` to allocate CPU backend buffer on macOS (#9875)
giladgd Oct 16, 2024
6d9c4c3
fix: allocating CPU buffer with size `0` (#9917)
giladgd Oct 16, 2024
b76d028
vulkan : add backend registry / device interfaces (#9721)
slaren Oct 17, 2024
f3a2a9c
readme : update bindings list (#9918)
ShenghaiWang Oct 17, 2024
be7c08c
llama : infill sampling handle very long tokens (#9924)
ggerganov Oct 17, 2024
04cea3e
llama : change warning to debug log
ggerganov Oct 17, 2024
4b6517a
readme : remove --memory-f32 references (#9925)
ggerganov Oct 17, 2024
e2d2676
llama : rename batch_all to batch (#8881)
danbev Oct 17, 2024
aa9a7ef
server : add n_indent parameter for line indentation requirement (#9929)
ggerganov Oct 18, 2024
e9bdc88
add amx kernel for gemm (#8998)
mingfeima Oct 18, 2024
ccb5cbc
[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)
OuadiElfarouki Oct 18, 2024
3d82c31
rpc : backend refactoring (#9912)
rgerganov Oct 18, 2024
4ca970e
llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
ngxson Oct 18, 2024
8afc59e
readme : update infra list (#9942)
icppWorld Oct 20, 2024
1fb651d
readme : update bindings list (#9951)
lcarrere Oct 20, 2024
f2b7df5
fix mul_mat_vec_q and *_vec_q error (#9939)
NeoZhangJianyu Oct 21, 2024
7e98ad6
fix error
arthw Nov 18, 2024
513628c
llama : default sampling changes + greedy update (#9897)
ggerganov Oct 21, 2024
dae14d8
rpc : pack only RPC structs (#9959)
rgerganov Oct 21, 2024
42cae13
ggml : add asserts for type conversion in fattn kernels (#9971)
ggerganov Oct 21, 2024
4797f3f
llama.vim : plugin for Neovim (#9787)
ggerganov Oct 21, 2024
913c941
arg : fix attention non-causal arg value hint (#9985)
danbev Oct 21, 2024
1253a86
readme : update UI list (#9972)
a-ghorbani Oct 21, 2024
9080044
llama.vim : move info to the right of screen [no ci] (#9787)
ggerganov Oct 21, 2024
4ad7cac
llama.vim : fix info text display [no ci] (#9787)
ggerganov Oct 21, 2024
c348050
arg : fix typo in embeddings argument help [no ci] (#9994)
danbev Oct 22, 2024
fe14ce7
[CANN] Adapt to dynamically loadable backends mechanism (#9970)
leo-pony Oct 22, 2024
e49ffeb
llama : add chat template for RWKV-World + fix EOT (#9968)
MollySophia Oct 22, 2024
bf62ff1
lora : warn user if new token is added in the adapter (#9948)
ngxson Oct 22, 2024
1c822cb
Rwkv chat template fix (#10001)
MollySophia Oct 22, 2024
798d274
llama : rename batch to ubatch (#9950)
danbev Oct 22, 2024
5f8b759
llama : fix empty batch causing llama_batch_allocr to crash (#9966)
ngxson Oct 22, 2024
bb38faa
flake.lock: Update
github-actions[bot] Oct 20, 2024
1f09871
metal : add POOL2D and fix IM2COL (#9943)
junhee-yoo Oct 23, 2024
7331682
llama.vim : add classic vim support (#9995)
m18coppola Oct 23, 2024
c1a832b
ggml : remove redundant set of contexts used field (ggml/978)
danbev Oct 16, 2024
4f13540
CUDA: fix 1D im2col, add tests (ggml/993)
JohannesGaessler Oct 18, 2024
8be40e3
llama.vim : bump generation time limit to 3s [no ci]
ggerganov Oct 23, 2024
ef3299e
sync : ggml
ggerganov Oct 23, 2024
9267d7f
server : samplers accept the prompt correctly (#10019)
wwoodsTM Oct 23, 2024
70f23e5
CUDA: fix MMQ for non-contiguous src0, add tests (#10021)
JohannesGaessler Oct 24, 2024
9136c32
CUDA: fix insufficient buffer clearing for MMQ (#10032)
JohannesGaessler Oct 24, 2024
a5f5aad
ci : fix cmake flags for SYCL
ggerganov Oct 24, 2024
e201e05
server : refactor slot input data, move tokenizer to HTTP thread (#10…
ngxson Oct 24, 2024
02fa87c
server : check that the prompt fits in the slot's context (#10030)
ggerganov Oct 25, 2024
745afd6
llamafile : extend sgemm.cpp support for Q5_0 models (#10010)
Srihari-mcw Oct 25, 2024
3df5c8f
llama: string_split fix (#10022)
Xarbirus Oct 25, 2024
45442a8
llama : add DRY sampler (#9702)
wwoodsTM Oct 25, 2024
d16d9c6
metal : support permuted matrix multiplicaions (#10033)
ggerganov Oct 25, 2024
426f785
scripts : fix amx sync [no ci]
ggerganov Oct 26, 2024
f29d8ae
increase cuda_cpy block size (ggml/996)
bssrdf Oct 23, 2024
fc74110
sync : ggml
ggerganov Oct 26, 2024
5aea5e3
llama : switch KQ multiplication to F32 precision by default (#10015)
ggerganov Oct 27, 2024
4d8bf20
server : don't overfill the batch during infill (#10018)
ggerganov Oct 28, 2024
96ed96e
musa: workaround for Guilty Lockup in cleaning src0 (#10042)
yeahdongcn Oct 28, 2024
0e06717
flake.lock: Update (#10063)
ggerganov Oct 28, 2024
b20b8bf
llama : Add IBM granite template (#10013)
arch-btw Oct 28, 2024
10d0baf
llama : remove Tail-Free sampling (#10071)
ggerganov Oct 29, 2024
75f60ed
ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the…
cyzero-kim Oct 29, 2024
9925572
llama : refactor model loader with backend registry (#10026)
slaren Oct 30, 2024
cb9d909
ggml : add Q4_0_8_8 RISC-V GEMV and GEMM kernels (#10029)
xctan Oct 30, 2024
2b65c36
convert : more detailed convert lora usage docs (#10065)
richdougherty Oct 30, 2024
04b5a05
readme : more lora detail in main example readme (#10064)
richdougherty Oct 30, 2024
a2df405
ggml : fix memory leaks when loading invalid gguf files (#10094)
slaren Oct 30, 2024
16860f3
kompute: add backend registry / device interfaces (#10045)
slp Oct 30, 2024
a88efa9
kompute: add mul_mat_q4_k shader (#10097)
slp Oct 31, 2024
2f89a20
ggml : check tensor name lengths in gguf files (#10100)
slaren Oct 31, 2024
963f06e
server : include scheme when printing URL (#10106)
bakkot Oct 31, 2024
83ad0ab
loader: refactor tensor weights storage (#9935)
kylo5aby Oct 31, 2024
b9110bb
llama : fix buffer checks for mamba and rwk (#10111)
slaren Oct 31, 2024
6c7075d
quantize : fix --keep-split (#10114)
slaren Oct 31, 2024
ff9ae48
llama : improve output buffer type selection (#10098)
slaren Oct 31, 2024
b5ec3ef
build: fix build error in Windows env with OneAPI setup (#10107)
kylo5aby Nov 1, 2024
30042ea
ggml : alloc ggml_contexts on the heap (whisper/2525)
ggerganov Nov 1, 2024
48faae1
sync : ggml
ggerganov Nov 1, 2024
a0a4c1a
ggml : remove ggml_scratch (#10121)
ggerganov Nov 1, 2024
92640ff
server : fix smart selection of available slot (#10120)
sasha0552 Nov 1, 2024
e0c054d
readme : update hot topics
ggerganov Nov 1, 2024
e05a98b
vulkan : improve ggml_vk_create_buffer error handling (#9898)
FanShupei Nov 1, 2024
9a7aa8f
llama : use smart pointers for ggml resources (#10117)
slaren Nov 1, 2024
cfeaa88
llama : add simple-chat example (#10124)
slaren Nov 1, 2024
b113fc2
convert-lora : make `--base` optional (#10110)
ngxson Nov 2, 2024
1bfe53e
simple-chat : only add bos on first prompt (#10129)
slaren Nov 2, 2024
6bbd992
llama : adjust default context size + print warnings (#10136)
ggerganov Nov 2, 2024
6a745da
server : fix endpoint checks (#10135)
ggerganov Nov 2, 2024
2bbe74d
server : fix slot selection by lru (#10126)
sasha0552 Nov 2, 2024
4535527
Add apple arm to presets (#10134)
kohnech Nov 2, 2024
037d06f
flake.lock: Update (#10146)
ggerganov Nov 3, 2024
eaea67b
metal : minor fixup in FA kernel (#10143)
ggerganov Nov 3, 2024
303f773
ggml : move CPU backend to a separate file (#10144)
slaren Nov 3, 2024
32ceae9
metal : fix minor string leaks (ggml/1004)
pminev Nov 1, 2024
bd2e5ca
cmake : make it possible linking ggml as external lib (ggml/1003)
ykhrustalev Nov 2, 2024
56a24a1
sync : ggml
ggerganov Nov 4, 2024
9b9b062
CANN: adjust backend registry refactor. (#10158)
leo-pony Nov 4, 2024
b1f5e04
metal : move dequantize templates to beginning of MSL source (#0)
ggerganov Nov 4, 2024
c6cd98d
metal : simplify f16 and f32 dequant kernels (#0)
ggerganov Nov 4, 2024
b7a5f05
cuda : clear error after changing peer access (#10153)
slaren Nov 4, 2024
e4a831f
fix build break on arm64 linux (#10166)
snadampal Nov 4, 2024
2cbc3b3
server : clarify /slots endpoint, add is_processing (#10162)
ngxson Nov 4, 2024
be58fd8
ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (#10167)
slaren Nov 4, 2024
b579fd6
ggml : fix gelu tables initialization (#10172)
slaren Nov 4, 2024
2489ae6
Q6_K AVX improvements (#10118)
netrunnereve Nov 4, 2024
6aaed11
ggml : fix arch check in bf16_to_fp32 (#10164)
slaren Nov 4, 2024
782b6c2
llama : add <|tool_call|> formatting to Granite template (#10177)
gabe-l-hart Nov 5, 2024
465390d
metal : add quantized FA support (#10149)
ggerganov Nov 6, 2024
b373c74
ggml : adjust is_first_call init value (#10193)
ggerganov Nov 6, 2024
702830e
metal : fix from ptr buffer name (#10189)
slaren Nov 6, 2024
5f4d30b
server : remove hack for extra parallel slot (#10187)
ggerganov Nov 6, 2024
915562e
metal : add BF16 support (#8439)
ggerganov Nov 6, 2024
4c7310d
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acc…
uniartisan Nov 7, 2024
ec9ad50
fix q4_0_8_8 format for corrupted tokens issue (#10198)
snadampal Nov 7, 2024
918e8c9
DRY: Fixes clone functionality (#10192)
wwoodsTM Nov 7, 2024
c95fbb4
Remove identical wte/etw logic for jais (#10203)
fmz Nov 7, 2024
00ce276
ggml : add ggml-cpu.h to the public headers (#10204)
slaren Nov 7, 2024
0bc267e
scripts : sync update
ggerganov Nov 7, 2024
9709520
sync : ggml
ggerganov Nov 7, 2024
7ddf853
scripts : add amx to sync-ggml.sh [no ci]
ggerganov Nov 7, 2024
63af0fe
server : revamp chat UI with vuejs and daisyui (#10175)
ngxson Nov 7, 2024
1d9d5b6
server : minor UI fix (#10207)
ngxson Nov 7, 2024
9c963b4
swift : exclude ggml-metal-embed.metal (#10211)
jhen0409 Nov 8, 2024
9fd4708
metal : optimize FA kernels (#10171)
ggerganov Nov 8, 2024
639012b
metal : improve clarity (minor) (#10171)
ggerganov Nov 8, 2024
5d78ec1
metal : opt-in compile flag for BF16 (#10218)
ggerganov Nov 8, 2024
ac6f39c
scripts : fix pattern and get n_tokens in one go (#10221)
lhpqaq Nov 9, 2024
26e073f
ggml : optimize llamafile cpu matrix multiplication for ppc64le (#10156)
amritahs-ibm Nov 9, 2024
c92611b
ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL oper…
SongXiaoXi Nov 9, 2024
e8f5048
metal : hide debug messages from normal log
ggerganov Nov 9, 2024
67c9362
llama : fix Qwen model type strings
ggerganov Nov 9, 2024
5cabf58
metal : fix F32 accumulation in FA vec kernel (#10232)
ggerganov Nov 9, 2024
08d5ccb
metal : fix build and some more comments (#10229)
ggerganov Nov 9, 2024
4958107
metal : reorder write loop in mul mat kernel + style (#10231)
ggerganov Nov 9, 2024
e9197b6
vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (#10…
jeffbolznv Nov 10, 2024
4205194
server : (web UI) Add back sampler settings (#10239)
MaggotHATE Nov 10, 2024
c7d7613
flake.lock: Update (#10243)
ggerganov Nov 10, 2024
0e21835
server : enable KV cache defrag by default (#10233)
ggerganov Nov 11, 2024
2d1d2f7
metal : more precise Q*K in FA vec kernel (#10247)
ggerganov Nov 11, 2024
e408282
vulkan: Throttle the number of shader compiles during the build step.…
jeffbolznv Nov 11, 2024
9850180
vulkan: Optimize contiguous copies (#10254)
jeffbolznv Nov 13, 2024
d916751
metadata: Detailed Dataset Authorship Metadata (#8875)
mofosyne Nov 13, 2024
e2161bd
server : fix incorrect res in validate_model_chat_template (#10272)
jhen0409 Nov 13, 2024
2dee43f
server : add missing docs (#10269)
z80maniac Nov 13, 2024
1aa6397
docs : update bindings list (#10261)
xuegao-tzx Nov 13, 2024
05a28e8
sync : ggml
ggerganov Nov 13, 2024
0fea840
llama : propagate the results of `graph_compute` (#9525)
Xarbirus Nov 13, 2024
38ae270
vulkan: Use macros to make the mat mul pipeline creation more concise…
jeffbolznv Nov 13, 2024
b850d04
vulkan: Optimize binary ops (#10270)
jeffbolznv Nov 14, 2024
1a2ed55
speculative : fix out-of-bounds access (#10289)
ggerganov Nov 14, 2024
e8cfa65
CUDA: no -sm row for very small matrices (#10185)
JohannesGaessler Nov 14, 2024
b3abcd3
ggml : build backends as libraries (#10256)
slaren Nov 14, 2024
136c8fd
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
chaxu01 Nov 15, 2024
e0dc41a
sycl: Use syclcompat::dp4a (#10267)
Rbiessy Nov 15, 2024
dfb85d3
scripts : fix regex in sync [no ci]
ggerganov Nov 15, 2024
5d67b5c
cann: dockerfile and doc adjustment (#10302)
noemotiovon Nov 15, 2024
9c6f9aa
server : (web UI) add copy button for code block, fix api key (#10242)
ngxson Nov 15, 2024
6b559cc
sycl: Update Intel docker images to use DPC++ 2025.0 (#10305)
Rbiessy Nov 15, 2024
985f2c9
ci: build test musa with cmake (#10298)
yeahdongcn Nov 15, 2024
7c9cd7f
AVX BF16 and single scale quant optimizations (#10212)
netrunnereve Nov 15, 2024
824a84b
sync : ggml
ggerganov Nov 15, 2024
016573e
ggml : vulkan logs (whisper/2547)
thewh1teagle Nov 15, 2024
c477f05
cmake : fix ppc64 check (whisper/0)
ggerganov Nov 15, 2024
890a1c0
ggml : fix some build issues
slaren Nov 15, 2024
59245e9
scripts: update compare-llama-bench.py (#10319)
JohannesGaessler Nov 15, 2024
157e7f3
Make updates to fix issues with clang-cl builds while using AVX512 fl…
Srihari-mcw Nov 15, 2024
0d12d86
llama : save number of parameters and the size in llama_model (#10286)
FirstTimeEZ Nov 16, 2024
4d50702
ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324)
eddnjjn Nov 16, 2024
28a69ac
vulkan : add cmake preset debug/release (#10306)
FirstTimeEZ Nov 16, 2024
0c29495
vulkan: Optimize some mat-vec mul quant shaders (#10296)
jeffbolznv Nov 16, 2024
e6f31f7
scripts : fix missing key in compare-llama-bench.py (#10332)
ggerganov Nov 16, 2024
cc34591
server: (web UI) Add samplers sequence customization (#10255)
MaggotHATE Nov 16, 2024
bb7a75b
make : auto-determine dependencies (#0)
ggerganov Nov 16, 2024
30beaa4
llamafile : fix include path (#0)
ggerganov Nov 16, 2024
5766f90
llama/ex: remove --logdir argument (#10339)
JohannesGaessler Nov 16, 2024
d150fad
docs : vulkan build instructions to use git bash mingw64 (#10303)
FirstTimeEZ Nov 16, 2024
59c887c
scripts : update sync
ggerganov Nov 16, 2024
2a2523d
ggml: new optimization interface (ggml/988)
JohannesGaessler Nov 16, 2024
29dd008
ggml : fix compile warnings (#0)
ggerganov Nov 16, 2024
6107bd1
tests : remove test-grad0
ggerganov Nov 16, 2024
2d1d5c4
make : add ggml-opt (#0)
ggerganov Nov 16, 2024
ac4cea5
ggml : adapt AMX to tensor->grad removal (#0)
ggerganov Nov 16, 2024
3e16a42
ggml : inttypes.h -> cinttypes (#0)
ggerganov Nov 16, 2024
05475dd
ggml : fix possible buffer use after free in sched reserve (#9930)
slaren Nov 17, 2024
d84419b
CMake: default to -arch=native for CUDA build (#10320)
JohannesGaessler Nov 17, 2024
2a2d6aa
CUDA: remove DMMV, consolidate F16 mult mat vec (#10318)
JohannesGaessler Nov 17, 2024
cc1e405
ggml : fix undefined reference to 'getcpu' (#10354)
FirstTimeEZ Nov 17, 2024
32dc5aa
metal : refactor kernel args into structs (#10238)
ggerganov Nov 17, 2024
57cbdbc
gitignore : ignore local run scripts [no ci]
ggerganov Nov 17, 2024
e0e284e
llama : only use default buffer types for the KV cache (#10358)
slaren Nov 17, 2024
7af08cb
CMake: fix typo in comment [no ci] (#10360)
JohannesGaessler Nov 17, 2024
1af6853
CUDA: fix MMV kernel being used for FP16 src1 (#10357)
JohannesGaessler Nov 17, 2024
0ed117a
docker: use GGML_NATIVE=OFF (#10368)
JohannesGaessler Nov 17, 2024
a979201
fix for windows building
arthw Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
25 changes: 11 additions & 14 deletions .devops/full-cuda.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,18 +1,16 @@
ARG UBUNTU_VERSION=22.04

# This needs to generally match the container host's environment.
ARG CUDA_VERSION=11.7.1

ARG CUDA_VERSION=12.6.0
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}

FROM ${BASE_CUDA_DEV_CONTAINER} AS build

# Unless otherwise specified, we make a fat build.
ARG CUDA_DOCKER_ARCH=all
# CUDA architecture to build for (defaults to all supported archs)
ARG CUDA_DOCKER_ARCH=default

RUN apt-get update && \
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev libgomp1
apt-get install -y build-essential cmake python3 python3-pip git libcurl4-openssl-dev libgomp1

COPY requirements.txt requirements.txt
COPY requirements requirements
Expand All @@ -24,13 +22,12 @@ WORKDIR /app

COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable CUDA
ENV GGML_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1

RUN make -j$(nproc)
# Use the default CUDA archs if not specified
RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
fi && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release -j$(nproc) && \
cp build/bin/* .

ENTRYPOINT ["/app/.devops/tools.sh"]
26 changes: 26 additions & 0 deletions .devops/full-musa.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc3.1.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

RUN apt-get update && \
apt-get install -y build-essential cmake python3 python3-pip git libcurl4-openssl-dev libgomp1

COPY requirements.txt requirements.txt
COPY requirements requirements

RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt

WORKDIR /app

COPY . .

RUN cmake -B build -DGGML_NATIVE=OFF -DGGML_MUSA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release -j$(nproc) && \
cp build/bin/* .

ENTRYPOINT ["/app/.devops/tools.sh"]
6 changes: 3 additions & 3 deletions .devops/full-rocm.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ FROM ${BASE_ROCM_DEV_CONTAINER} AS build
# Unless otherwise specified, we make a fat build.
# List from https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1682807878
# This is mostly tied to rocBLAS supported archs.
ARG ROCM_DOCKER_ARCH=\
ARG ROCM_DOCKER_ARCH="\
gfx803 \
gfx900 \
gfx906 \
Expand All @@ -21,7 +21,7 @@ ARG ROCM_DOCKER_ARCH=\
gfx1030 \
gfx1100 \
gfx1101 \
gfx1102
gfx1102"

COPY requirements.txt requirements.txt
COPY requirements requirements
Expand All @@ -34,7 +34,7 @@ WORKDIR /app
COPY . .

# Set nvcc architecture
ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
ENV AMDGPU_TARGETS=${ROCM_DOCKER_ARCH}
# Enable ROCm
ENV GGML_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
Expand Down
44 changes: 44 additions & 0 deletions .devops/llama-cli-cann.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
ARG ASCEND_VERSION=8.0.rc2.alpha003-910b-openeuler22.03-py3.8

FROM ascendai/cann:$ASCEND_VERSION AS build

WORKDIR /app

COPY . .

RUN yum install -y gcc g++ cmake make
ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:$LIBRARY_PATH
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH}
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:${PYTHONPATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${PATH}
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}

# find libascend_hal.so, because the drive hasn`t been mounted.
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/runtime/lib64/stub:$LD_LIBRARY_PATH

RUN echo "Building with static libs" && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh --force && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_CANN=ON -DBUILD_SHARED_LIBS=OFF && \
cmake --build build --config Release --target llama-cli

# TODO: use image with NNRT
FROM ascendai/cann:$ASCEND_VERSION AS runtime
COPY --from=build /app/build/bin/llama-cli /llama-cli

ENV LC_ALL=C.utf8

ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:$LIBRARY_PATH
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH}
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:${PYTHONPATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${PATH}
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}

ENTRYPOINT ["/llama-cli" ]
25 changes: 14 additions & 11 deletions .devops/llama-cli-cuda.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,35 +1,38 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=11.7.1
ARG CUDA_VERSION=12.6.0
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the CUDA runtime image
ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

FROM ${BASE_CUDA_DEV_CONTAINER} AS build

# Unless otherwise specified, we make a fat build.
ARG CUDA_DOCKER_ARCH=all
# CUDA architecture to build for (defaults to all supported archs)
ARG CUDA_DOCKER_ARCH=default

RUN apt-get update && \
apt-get install -y build-essential git
apt-get install -y build-essential git cmake

WORKDIR /app

COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable CUDA
ENV GGML_CUDA=1

RUN make -j$(nproc) llama-cli
# Use the default CUDA archs if not specified
RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
fi && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release --target llama-cli -j$(nproc) && \
mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;

FROM ${BASE_CUDA_RUN_CONTAINER} AS runtime

RUN apt-get update && \
apt-get install -y libgomp1

COPY --from=build /app/llama-cli /llama-cli
COPY --from=build /app/lib/ /
COPY --from=build /app/build/bin/llama-cli /

ENTRYPOINT [ "/llama-cli" ]
4 changes: 2 additions & 2 deletions .devops/llama-cli-intel.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ARG ONEAPI_VERSION=2024.1.1-devel-ubuntu22.04
ARG ONEAPI_VERSION=2025.0.0-0-devel-ubuntu22.04

FROM intel/oneapi-basekit:$ONEAPI_VERSION AS build

Expand All @@ -15,7 +15,7 @@ RUN if [ "${GGML_SYCL_F16}" = "ON" ]; then \
export OPT_SYCL_F16="-DGGML_SYCL_F16=ON"; \
fi && \
echo "Building with static libs" && \
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx \
cmake -B build -DGGML_NATIVE=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx \
${OPT_SYCL_F16} -DBUILD_SHARED_LIBS=OFF && \
cmake --build build --config Release --target llama-cli

Expand Down
31 changes: 31 additions & 0 deletions .devops/llama-cli-musa.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc3.1.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the MUSA runtime image
ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

RUN apt-get update && \
apt-get install -y build-essential git cmake

WORKDIR /app

COPY . .

RUN cmake -B build -DGGML_NATIVE=OFF -DGGML_MUSA=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release --target llama-cli -j$(nproc) && \
mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;

FROM ${BASE_MUSA_RUN_CONTAINER} AS runtime

RUN apt-get update && \
apt-get install -y libgomp1

COPY --from=build /app/lib/ /
COPY --from=build /app/build/bin/llama-cli /llama-cli

ENTRYPOINT [ "/llama-cli" ]
6 changes: 3 additions & 3 deletions .devops/llama-cli-rocm.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ FROM ${BASE_ROCM_DEV_CONTAINER} AS build
# Unless otherwise specified, we make a fat build.
# List from https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1682807878
# This is mostly tied to rocBLAS supported archs.
ARG ROCM_DOCKER_ARCH=\
ARG ROCM_DOCKER_ARCH="\
gfx803 \
gfx900 \
gfx906 \
Expand All @@ -21,7 +21,7 @@ ARG ROCM_DOCKER_ARCH=\
gfx1030 \
gfx1100 \
gfx1101 \
gfx1102
gfx1102"

COPY requirements.txt requirements.txt
COPY requirements requirements
Expand All @@ -34,7 +34,7 @@ WORKDIR /app
COPY . .

# Set nvcc architecture
ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
ENV AMDGPU_TARGETS=${ROCM_DOCKER_ARCH}
# Enable ROCm
ENV GGML_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
Expand Down
2 changes: 1 addition & 1 deletion .devops/llama-cli-vulkan.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ RUN wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key
# Build it
WORKDIR /app
COPY . .
RUN cmake -B build -DGGML_VULKAN=1 && \
RUN cmake -B build -DGGML_NATIVE=OFF -DGGML_VULKAN=1 && \
cmake --build build --config Release --target llama-cli

# Clean up
Expand Down
30 changes: 17 additions & 13 deletions .devops/llama-server-cuda.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,38 +1,42 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=11.7.1
ARG CUDA_VERSION=12.6.0
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the CUDA runtime image
ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

FROM ${BASE_CUDA_DEV_CONTAINER} AS build

# Unless otherwise specified, we make a fat build.
ARG CUDA_DOCKER_ARCH=all
# CUDA architecture to build for (defaults to all supported archs)
ARG CUDA_DOCKER_ARCH=default

RUN apt-get update && \
apt-get install -y build-essential git libcurl4-openssl-dev
apt-get install -y build-essential git cmake libcurl4-openssl-dev

WORKDIR /app

COPY . .

# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable CUDA
ENV GGML_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1

RUN make -j$(nproc) llama-server
# Use the default CUDA archs if not specified
RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
fi && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release --target llama-server -j$(nproc) && \
mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;

FROM ${BASE_CUDA_RUN_CONTAINER} AS runtime

RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev libgomp1 curl

COPY --from=build /app/llama-server /llama-server
COPY --from=build /app/lib/ /
COPY --from=build /app/build/bin/llama-server /llama-server

# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]

Expand Down
6 changes: 4 additions & 2 deletions .devops/llama-server-intel.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ARG ONEAPI_VERSION=2024.1.1-devel-ubuntu22.04
ARG ONEAPI_VERSION=2025.0.0-0-devel-ubuntu22.04

FROM intel/oneapi-basekit:$ONEAPI_VERSION AS build

Expand All @@ -15,7 +15,7 @@ RUN if [ "${GGML_SYCL_F16}" = "ON" ]; then \
export OPT_SYCL_F16="-DGGML_SYCL_F16=ON"; \
fi && \
echo "Building with dynamic libs" && \
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_CURL=ON ${OPT_SYCL_F16} && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_CURL=ON ${OPT_SYCL_F16} && \
cmake --build build --config Release --target llama-server

FROM intel/oneapi-basekit:$ONEAPI_VERSION AS runtime
Expand All @@ -26,6 +26,8 @@ RUN apt-get update && \
COPY --from=build /app/build/bin/llama-server /llama-server

ENV LC_ALL=C.utf8
# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]

Expand Down
36 changes: 36 additions & 0 deletions .devops/llama-server-musa.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc3.1.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the MUSA runtime image
ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

RUN apt-get update && \
apt-get install -y build-essential git cmake libcurl4-openssl-dev

WORKDIR /app

COPY . .

RUN cmake -B build -DGGML_NATIVE=OFF -DGGML_MUSA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release --target llama-server -j$(nproc) && \
mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;

FROM ${BASE_MUSA_RUN_CONTAINER} AS runtime

RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev libgomp1 curl

COPY --from=build /app/lib/ /
COPY --from=build /app/build/bin/llama-server /llama-server

# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]

ENTRYPOINT [ "/llama-server" ]
Loading
Loading