You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`--props`| enable changing global properties via POST /props (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_PROPS) |
@@ -320,7 +319,6 @@ node index.js
320
319
321
320
- The prompt is a string or an array with the first element given as a string
322
321
- The model's `tokenizer.ggml.add_bos_token` metadata is `true`
323
-
- The system prompt is empty
324
322
325
323
`temperature`: Adjust the randomness of the generated text. Default: `0.8`
326
324
@@ -536,14 +534,12 @@ This endpoint is public (no API key check). By default, it is read-only. To make
536
534
537
535
```json
538
536
{
539
-
"system_prompt": "",
540
537
"default_generation_settings": { ... },
541
538
"total_slots": 1,
542
539
"chat_template": ""
543
540
}
544
541
```
545
542
546
-
-`system_prompt` - the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
547
543
-`default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
548
544
-`total_slots` - the total number of slots for process requests (defined by `--parallel` option)
549
545
-`chat_template` - the model's original Jinja2 prompt template
@@ -554,7 +550,7 @@ To use this endpoint with POST method, you need to start server with `--props`
554
550
555
551
*Options:*
556
552
557
-
-`system_prompt`: Change the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
553
+
-None yet
558
554
559
555
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
// if context shift is disabled, we make sure prompt size is smaller than KV size
2053
-
if ((int) system_tokens.size() + slot.n_prompt_tokens >= slot.n_ctx) {
1996
+
if (slot.n_prompt_tokens >= slot.n_ctx) {
2054
1997
slot.release();
2055
1998
send_error(slot, "the request exceeds the available context size. try increasing the context size or enable context shift", ERROR_TYPE_INVALID_REQUEST);
2056
1999
continue;
@@ -2138,22 +2081,19 @@ struct server_context {
2138
2081
}
2139
2082
2140
2083
// keep only the common part
2141
-
int p0 = (int) system_tokens.size() + slot.n_past;
2084
+
int p0 = slot.n_past;
2085
+
2142
2086
if (!llama_kv_cache_seq_rm(ctx, slot.id + 1, p0, -1)) {
2143
2087
// could not partially delete (likely using a non-Transformer model)
0 commit comments