Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temp #26

Merged
merged 72 commits into from
Dec 1, 2024
Merged

Temp #26

Changes from 1 commit
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
b756441
metal : minor code formatting
ggerganov Nov 25, 2024
f6d12e7
tests : fix compile warning
ggerganov Nov 25, 2024
5931c1f
ggml : add support for dynamic loading of backends (#10469)
slaren Nov 25, 2024
9ca2e67
server : add speculative decoding support (#10455)
ggerganov Nov 25, 2024
a9a678a
Add download chat feature to server chat (#10481)
brucepro Nov 25, 2024
1f92225
Github: update issue templates [no ci] (#10489)
JohannesGaessler Nov 25, 2024
10bce04
llama : accept a list of devices to use to offload a model (#10497)
slaren Nov 25, 2024
80acb7b
Rename Olmo1124 to Olmo2 (#10500)
2015aroras Nov 25, 2024
106964e
metal : enable mat-vec kernels for bs <= 4 (#10491)
ggerganov Nov 25, 2024
47f931c
server : enable cache_prompt by default (#10501)
ggerganov Nov 25, 2024
9fd8c26
server : add more information about error (#10455)
ggerganov Nov 25, 2024
50d5cec
ci : build docker images only once daily (#10503)
slaren Nov 25, 2024
0cc6375
Introduce llama-run (#10291)
ericcurtin Nov 25, 2024
0eb4e12
vulkan: Fix a vulkan-shaders-gen arugment parsing error (#10484)
sparkleholic Nov 26, 2024
7066b4c
CANN: RoPE and CANCAT operator optimization (#10488)
noemotiovon Nov 26, 2024
9a4b79b
CANN: Improve the Inferencing Performance for Ascend NPU Device (#10454)
shen-shanshan Nov 26, 2024
811872a
speculative : simplify the implementation (#10504)
ggerganov Nov 26, 2024
84e1c33
server : fix parallel speculative decoding (#10513)
ggerganov Nov 26, 2024
25669aa
ggml-cpu: cmake add arm64 cpu feature check for macos (#10487)
chaxu01 Nov 26, 2024
c6807b3
ci : add ubuntu cuda build, build with one arch on windows (#10456)
slaren Nov 26, 2024
7db3846
ci : publish the docker images created during scheduled runs (#10515)
slaren Nov 26, 2024
ab96610
cmake : enable warnings in llama (#10474)
ggerganov Nov 26, 2024
0bbd226
restore the condistion to build & update pacakge when merge (#10507)
NeoZhangJianyu Nov 26, 2024
45abe0f
server : replace behave with pytest (#10416)
ngxson Nov 26, 2024
904109e
vulkan: fix group_norm (#10496)
jeffbolznv Nov 26, 2024
249cd93
mtgpu: Add MUSA_DOCKER_ARCH in Dockerfiles && update cmake and make (…
yeahdongcn Nov 26, 2024
be0e350
Fix HIP flag inconsistency & build docs (#10524)
tristandruyen Nov 26, 2024
30ec398
llama : disable warnings for 3rd party sha1 dependency (#10527)
slaren Nov 26, 2024
5a349f2
ci : remove nix workflows (#10526)
slaren Nov 26, 2024
de50973
Add OLMo 2 model in docs (#10530)
2015aroras Nov 26, 2024
c9b00a7
ci : fix cuda releases (#10532)
slaren Nov 26, 2024
4a57d36
vulkan: optimize Q2_K and Q3_K mul_mat_vec (#10459)
jeffbolznv Nov 27, 2024
71a6498
vulkan: skip integer div/mod in get_offsets for batch_idx==0 (#10506)
jeffbolznv Nov 27, 2024
249a790
vulkan: further optimize q5_k mul_mat_vec (#10479)
jeffbolznv Nov 27, 2024
5b3466b
vulkan: Handle GPUs with less shared memory (#10468)
jeffbolznv Nov 27, 2024
c31ed2a
vulkan: define all quant data structures in types.comp (#10440)
jeffbolznv Nov 27, 2024
9150f8f
Do not include arm_neon.h when compiling CUDA code (ggml/1028)
frankier Nov 26, 2024
fee824a
sync : ggml
ggerganov Nov 27, 2024
9e2301f
metal : fix group_norm support condition (#0)
ggerganov Nov 27, 2024
46c69e0
ci : faster CUDA toolkit installation method and use ccache (#10537)
slaren Nov 27, 2024
3ad5451
Add some minimal optimizations for CDNA (#10498)
IMbackK Nov 27, 2024
9f91251
common : fix duplicated file name with hf_repo and hf_file (#10550)
ngxson Nov 27, 2024
b742013
CANN: ROPE operator optimization (#10540)
noemotiovon Nov 28, 2024
605fa66
CANN: Fix SOC_TYPE compile bug (#10519)
leo-pony Nov 28, 2024
c6bc739
CANN: Update cann.md to display correctly in CLion (#10538)
HRXWEB Nov 28, 2024
2025fa6
kompute : improve backend to pass test_backend_ops (#10542)
slp Nov 28, 2024
c202cef
ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
FanShupei Nov 28, 2024
eea986f
cmake : fix ARM feature detection (#10543)
ggerganov Nov 28, 2024
76b27d2
ggml : fix row condition for i8mm kernels (#10561)
ggerganov Nov 28, 2024
e90688e
ci : fix tag name in cuda and hip releases (#10566)
slaren Nov 28, 2024
7281cf1
docs: fix outdated usage of llama-simple (#10565)
rand-fly Nov 28, 2024
8907193
common: fix warning message when no GPU found (#10564)
JohannesGaessler Nov 28, 2024
6c59567
server : (tests) don't use thread for capturing stdout/stderr, bump o…
ngxson Nov 28, 2024
4c0a95b
llama : add missing model types
ggerganov Nov 28, 2024
dc22344
ggml : remove redundant copyright notice + update authors
ggerganov Nov 28, 2024
678d799
llava: return false instead of exit (#10546)
tinglou Nov 29, 2024
f095a64
vulkan: get the first command buffer submitted sooner (#10499)
jeffbolznv Nov 29, 2024
938f608
CANN: RoPE operator optimization (#10563)
noemotiovon Nov 29, 2024
266b851
sycl : Reroute permuted mul_mats through oneMKL (#10408)
Alcpz Nov 29, 2024
0f77aae
sycl : offload of get_rows set to 0 (#10432)
Alcpz Nov 29, 2024
4b3242b
ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580)
FanShupei Nov 29, 2024
f0678c5
ggml : fix I8MM Q4_1 scaling factor conversion (#10562)
ggerganov Nov 29, 2024
a3a3048
cleanup UI link list (#10577)
slaren Nov 29, 2024
3a8e9af
imatrix : support combine-only (#10492)
robbiemu Nov 29, 2024
b782e5c
server : add more test cases (#10569)
ngxson Nov 29, 2024
7cc2d2c
ggml : move AMX to the CPU backend (#10570)
slaren Nov 29, 2024
0533e7f
vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536)
netrunnereve Nov 30, 2024
abadba0
readme : refresh (#10587)
ggerganov Nov 30, 2024
3e0ba0e
readme : remove old badge
ggerganov Nov 30, 2024
0c39f44
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_…
angt Nov 30, 2024
43957ef
build: update Makefile comments for C++ version change (#10598)
wangqin0 Dec 1, 2024
cf80952
Merge branch 'master' into Temp
apicalshark Dec 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
server : enable cache_prompt by default (ggml-org#10501)
ggml-ci
ggerganov authored Nov 25, 2024
commit 47f931c8f9a26c072d71224bc8013cc66ea9e445
2 changes: 1 addition & 1 deletion examples/server/README.md
Original file line number Diff line number Diff line change
@@ -412,7 +412,7 @@ node index.js

`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot. Default: `-1`

`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`
`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `true`

`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["dry", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature"]` - these are all the available values.

4 changes: 2 additions & 2 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
@@ -111,7 +111,7 @@ struct server_static_file {

struct slot_params {
bool stream = true;
bool cache_prompt = false; // remember the prompt to avoid reprocessing all prompt
bool cache_prompt = true; // remember the prompt to avoid reprocessing all prompt

int32_t n_keep = 0; // number of tokens to keep from initial prompt
int32_t n_discard = 0; // number of tokens after n_keep that may be discarded when shifting context, 0 defaults to half
@@ -883,7 +883,7 @@ struct server_context {
}

slot.params.stream = json_value(data, "stream", false);
slot.params.cache_prompt = json_value(data, "cache_prompt", false);
slot.params.cache_prompt = json_value(data, "cache_prompt", true);
slot.params.n_predict = json_value(data, "n_predict", json_value(data, "max_tokens", defaults.n_predict));
slot.params.n_indent = json_value(data, "n_indent", defaults.n_indent);
slot.params.n_keep = json_value(data, "n_keep", defaults.n_keep);