Squashed commit of the following:

commit e18a046 Author: kabachuha <artemkhrapov2001@yandex.ru> Date: Sat Nov 4 22:12:51 2023 +0300 fix openai extension not working because of absent new defaults (oobabooga#4477) commit b7a409e Merge: b5c5304 fb3bd02 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Sat Nov 4 15:04:43 2023 -0300 Merge pull request oobabooga#4476 from oobabooga/dev Merge dev branch commit fb3bd02 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Sat Nov 4 11:02:24 2023 -0700 Update docs commit 1d8c7c1 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Sat Nov 4 11:01:15 2023 -0700 Update docs commit b5c5304 Merge: 262f8ae 40f7f37 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Sat Nov 4 14:19:55 2023 -0300 Merge pull request oobabooga#4475 from oobabooga/dev Merge dev branch commit 40f7f37 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Sat Nov 4 10:12:06 2023 -0700 Update requirements commit 2081f43 Author: Orang <51061118+Soefati@users.noreply.github.com> Date: Sun Nov 5 00:00:24 2023 +0700 Bump transformers to 4.35.* (oobabooga#4474) commit 4766a57 Author: feng lui <3090641@qq.com> Date: Sun Nov 5 00:59:33 2023 +0800 transformers: add use_flash_attention_2 option (oobabooga#4373) commit add3593 Author: wouter van der plas <2423856+wvanderp@users.noreply.github.com> Date: Sat Nov 4 17:41:42 2023 +0100 fixed two links in the ui (oobabooga#4452) commit cfbd108 Author: Casper <casperbh.96@gmail.com> Date: Sat Nov 4 17:09:41 2023 +0100 Bump AWQ to 0.1.6 (oobabooga#4470) commit aa5d671 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Sat Nov 4 13:09:07 2023 -0300 Add temperature_last parameter (oobabooga#4472) commit 1ab8700 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Fri Nov 3 17:38:19 2023 -0700 Change frequency/presence penalty ranges commit 45fcb60 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Fri Nov 3 11:29:31 2023 -0700 Make truncation_length_max apply to max_seq_len/n_ctx commit 7f9c1cb Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Fri Nov 3 08:25:22 2023 -0700 Change min_p default to 0.0 commit 4537853 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Fri Nov 3 08:13:50 2023 -0700 Change min_p default to 1.0 commit 367e5e6 Author: kalomaze <66376113+kalomaze@users.noreply.github.com> Date: Thu Nov 2 14:32:51 2023 -0500 Implement Min P as a sampler option in HF loaders (oobabooga#4449) commit fcb7017 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Thu Nov 2 12:24:09 2023 -0700 Remove a checkbox commit fdcaa95 Author: Julien Chaumond <julien@huggingface.co> Date: Thu Nov 2 20:20:54 2023 +0100 transformers: Add a flag to force load from safetensors (oobabooga#4450) commit c065547 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Thu Nov 2 11:23:04 2023 -0700 Add cache_8bit option commit 42f8163 Merge: 77abd9b a56ef2a Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Thu Nov 2 11:09:26 2023 -0700 Merge remote-tracking branch 'refs/remotes/origin/dev' into dev commit 77abd9b Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Thu Nov 2 08:19:42 2023 -0700 Add no_flash_attn option commit a56ef2a Author: Julien Chaumond <julien@huggingface.co> Date: Thu Nov 2 18:07:08 2023 +0100 make torch.load a bit safer (oobabooga#4448) commit deba039 Author: deevis <darren.hicks@gmail.com> Date: Tue Oct 31 22:51:00 2023 -0600 (fix): OpenOrca-Platypus2 models should use correct instruction_template and custom_stopping_strings (oobabooga#4435) commit aaf726d Author: Mehran Ziadloo <mehranziadloo@gmail.com> Date: Tue Oct 31 21:29:57 2023 -0700 Updating the shared settings object when loading a model (oobabooga#4425) commit 9bd0724 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Tue Oct 31 20:57:56 2023 -0700 Change frequency/presence penalty ranges commit 6b7fa45 Author: Orang <51061118+Soefati@users.noreply.github.com> Date: Wed Nov 1 05:12:14 2023 +0700 Update exllamav2 version (oobabooga#4417) commit 41e159e Author: Casper <casperbh.96@gmail.com> Date: Tue Oct 31 23:11:22 2023 +0100 Bump AutoAWQ to v0.1.5 (oobabooga#4410) commit 0707ed7 Author: Meheret <101792782+senadev42@users.noreply.github.com> Date: Wed Nov 1 01:09:05 2023 +0300 updated wiki link (oobabooga#4415) commit 262f8ae Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Fri Oct 27 06:49:14 2023 -0700 Use default gr.Dataframe for evaluation table commit f481ce3 Author: James Braza <jamesbraza@gmail.com> Date: Thu Oct 26 21:02:28 2023 -0700 Adding `platform_system` to `autoawq` (oobabooga#4390) commit af98587 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri Oct 27 00:46:16 2023 -0300 Update accelerate requirement from ==0.23.* to ==0.24.* (oobabooga#4400) commit 839a87b Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Thu Oct 26 20:26:25 2023 -0700 Fix is_ccl_available & is_xpu_available imports commit 778a010 Author: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Date: Fri Oct 27 08:09:51 2023 +0530 Intel Gpu support initialization (oobabooga#4340) commit 317e2c8 Author: GuizzyQC <86683381+GuizzyQC@users.noreply.github.com> Date: Thu Oct 26 22:03:21 2023 -0400 sd_api_pictures: fix Gradio warning message regarding custom value (oobabooga#4391) commit 92b2f57 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Thu Oct 26 18:57:32 2023 -0700 Minor metadata bug fix (second attempt) commit 2d97897 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Wed Oct 25 11:21:18 2023 -0700 Don't install flash-attention on windows + cuda 11 commit 0ced78f Author: LightningDragon <lightningdragon96@gmail.com> Date: Wed Oct 25 09:15:34 2023 -0600 Replace hashlib.sha256 with hashlib.file_digest so we don't need to load entire files into ram before hashing them. (oobabooga#4383) commit 72f6fc6 Author: tdrussell <6509934+tdrussell@users.noreply.github.com> Date: Wed Oct 25 10:10:28 2023 -0500 Rename additive_repetition_penalty to presence_penalty, add frequency_penalty (oobabooga#4376) commit ef1489c Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Mon Oct 23 20:45:43 2023 -0700 Remove unused parameter in AutoAWQ commit 1edf321 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Mon Oct 23 13:09:03 2023 -0700 Lint commit 280ae72 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Mon Oct 23 13:07:17 2023 -0700 Organize commit 49e5eec Merge: 82c11be 4bc4113 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Mon Oct 23 12:54:05 2023 -0700 Merge remote-tracking branch 'refs/remotes/origin/main' commit 82c11be Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Mon Oct 23 12:49:07 2023 -0700 Update 04 - Model Tab.md commit 306d764 Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Mon Oct 23 12:46:24 2023 -0700 Minor metadata bug fix commit 4bc4113 Author: adrianfiedler <adrian_fiedler@msn.com> Date: Mon Oct 23 19:09:57 2023 +0200 Fix broken links (oobabooga#4367) --------- Co-authored-by: oobabooga <112222186+oobabooga@users.noreply.github.com> commit 92691ee Author: oobabooga <112222186+oobabooga@users.noreply.github.com> Date: Mon Oct 23 09:57:44 2023 -0700 Disable trust_remote_code by default
Vechtomov · Nov 6, 2023 · 2273473 · 2273473
1 parent bb59dc3
commit 2273473
Show file tree

Hide file tree

Showing 45 changed files with 384 additions and 174 deletions.
diff --git a/.gitignore b/.gitignore
@@ -26,6 +26,7 @@
 .DS_Store
 .eslintrc.js
 .idea
+.env
 .venv
 venv
 *.bak

diff --git a/README.md b/README.md
@@ -18,8 +18,8 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
 * 4-bit, 8-bit, and CPU inference through the transformers library
 * Use llama.cpp models with transformers samplers (`llamacpp_HF` loader)
 * [Multimodal pipelines, including LLaVA and MiniGPT-4](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal)
-* [Extensions framework](docs/Extensions.md)
-* [Custom chat characters](docs/Chat-mode.md)
+* [Extensions framework](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions)
+* [Custom chat characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character)
 * Very efficient text streaming
 * Markdown output with LaTeX rendering, to use for instance with [GALACTICA](https://github.com/paperswithcode/galai)
 * API, including endpoints for websocket streaming ([see the examples](https://github.com/oobabooga/text-generation-webui/blob/main/api-examples))
@@ -60,7 +60,7 @@ To define persistent command-line flags like `--listen` or `--api`, edit the `CM
 #### Other info
 
 * There is no need to run any of those scripts as admin/root.
-* For additional instructions about AMD setup, WSL setup, and nvcc installation, consult [this page](https://github.com/oobabooga/text-generation-webui/blob/main/docs/One-Click-Installers.md).
+* For additional instructions about AMD setup, WSL setup, and nvcc installation, consult [the documentation](https://github.com/oobabooga/text-generation-webui/wiki).
 * The installer has been tested mostly on NVIDIA GPUs. If you can find a way to improve it for your AMD/Intel Arc/Mac Metal GPU, you are highly encouraged to submit a PR to this repository. The main file to be edited is `one_click.py`.
 * For automated installation, you can use the `GPU_CHOICE`, `USE_CUDA118`, `LAUNCH_AFTER_INSTALL`, and `INSTALL_EXTENSIONS` environment variables. For instance: `GPU_CHOICE=A USE_CUDA118=FALSE LAUNCH_AFTER_INSTALL=FALSE INSTALL_EXTENSIONS=FALSE ./start_linux.sh`.
 
@@ -170,7 +170,7 @@ cp docker/.env.example .env
 docker compose up --build
 ```
 
-* You need to have docker compose v2.17 or higher installed. See [this guide](https://github.com/oobabooga/text-generation-webui/blob/main/docs/Docker.md) for instructions.
+* You need to have docker compose v2.17 or higher installed. See [this guide](https://github.com/oobabooga/text-generation-webui/wiki/09-%E2%80%90-Docker) for instructions.
 * For additional docker files, check out [this repository](https://github.com/Atinoda/text-generation-webui-docker).
 
 ### Updating the requirements
@@ -300,6 +300,7 @@ Optionally, you can use the following command-line flags:
 | `--sdp-attention`                           | Use PyTorch 2.0's SDP attention. Same as above. |
 | `--trust-remote-code`                       | Set `trust_remote_code=True` while loading the model. Necessary for some models. |
 | `--use_fast`                                | Set `use_fast=True` while loading the tokenizer. |
+| `--use_flash_attention_2`                   | Set use_flash_attention_2=True while loading the model. |
 
 #### Accelerate 4-bit
 
@@ -336,6 +337,8 @@ Optionally, you can use the following command-line flags:
 |`--gpu-split`     | Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7. |
 |`--max_seq_len MAX_SEQ_LEN`           | Maximum sequence length. |
 |`--cfg-cache`                         | ExLlama_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader, but not necessary for CFG with base ExLlama. |
+|`--no_flash_attn`                     | Force flash-attention to not be used. |
+|`--cache_8bit`                        | Use 8-bit cache to save VRAM. |
 
 #### AutoGPTQ
 

diff --git a/api-examples/api-example-chat-stream.py b/api-examples/api-example-chat-stream.py
@@ -52,7 +52,8 @@ async def run(user_input, history):
         'tfs': 1,
         'top_a': 0,
         'repetition_penalty': 1.18,
-        'additive_repetition_penalty': 0,
+        'presence_penalty': 0,
+        'frequency_penalty': 0,
         'repetition_penalty_range': 0,
         'top_k': 40,
         'min_length': 0,

diff --git a/api-examples/api-example-chat.py b/api-examples/api-example-chat.py
@@ -46,7 +46,8 @@ def run(user_input, history):
         'tfs': 1,
         'top_a': 0,
         'repetition_penalty': 1.18,
-        'additive_repetition_penalty': 0,
+        'presence_penalty': 0,
+        'frequency_penalty': 0,
         'repetition_penalty_range': 0,
         'top_k': 40,
         'min_length': 0,

diff --git a/api-examples/api-example-stream.py b/api-examples/api-example-stream.py
@@ -35,7 +35,8 @@ async def run(context):
         'tfs': 1,
         'top_a': 0,
         'repetition_penalty': 1.18,
-        'additive_repetition_penalty': 0,
+        'presence_penalty': 0,
+        'frequency_penalty': 0,
         'repetition_penalty_range': 0,
         'top_k': 40,
         'min_length': 0,

diff --git a/api-examples/api-example.py b/api-examples/api-example.py
@@ -27,7 +27,8 @@ def run(prompt):
         'tfs': 1,
         'top_a': 0,
         'repetition_penalty': 1.18,
-        'additive_repetition_penalty': 0,
+        'presence_penalty': 0,
+        'frequency_penalty': 0,
         'repetition_penalty_range': 0,
         'top_k': 40,
         'min_length': 0,

diff --git a/css/main.css b/css/main.css
@@ -648,11 +648,3 @@ div.svelte-362y77>*, div.svelte-362y77>.form>* {
 .options {
     z-index: 100 !important;
 }
-
-/* ----------------------------------------------
-  Increase the height of the evaluation table
----------------------------------------------- */
-#evaluation-table table {
-    max-height: none !important;
-    overflow-y: auto !important;
-}
diff --git a/docs/03 ‐ Parameters Tab.md b/docs/03 ‐ Parameters Tab.md
@@ -33,9 +33,11 @@ For more information about the parameters, the [transformers documentation](http
 * **max_new_tokens**: Maximum number of tokens to generate. Don't set it higher than necessary: it is used in the truncation calculation through the formula `(prompt_length) = min(truncation_length - max_new_tokens, prompt_length)`, so your prompt will get truncated if you set it too high.
 * **temperature**: Primary factor to control the randomness of outputs. 0 = deterministic (only the most likely token is used). Higher value = more randomness.
 * **top_p**: If not set to 1, select tokens with probabilities adding up to less than this number. Higher value = higher range of possible random results.
+* **min_p**: Tokens with probability smaller than `(min_p) * (probability of the most likely token)` are discarded. This is the same as top_a but without squaring the probability.
 * **top_k**: Similar to top_p, but select instead only the top_k most likely tokens. Higher value = higher range of possible random results.
 * **repetition_penalty**: Penalty factor for repeating prior tokens. 1 means no penalty, higher value = less repetition, lower value = more repetition.
-* **additive_repetition_penalty**: Similar to repetition_penalty, but with an additive offset on the raw token scores instead of a multiplicative factor. It may generate better results. 0 means no penalty, higher value = less repetition, lower value = more repetition.
+* **presence_penalty**: Similar to repetition_penalty, but with an additive offset on the raw token scores instead of a multiplicative factor. It may generate better results. 0 means no penalty, higher value = less repetition, lower value = more repetition. Previously called "additive_repetition_penalty".
+* **frequency_penalty**: Repetition penalty that scales based on how many times the token has appeared in the context. Be careful with this; there's no limit to how much a token can be penalized.
 * **repetition_penalty_range**: The number of most recent tokens to consider for repetition penalty. 0 makes all tokens be used.
 * **typical_p**: If not set to 1, select only tokens that are at least this much more likely to appear than random tokens, given the prior text.
 * **tfs**: Tries to detect a tail of low-probability tokens in the distribution and removes those tokens. See [this blog post](https://www.trentonbricken.com/Tail-Free-Sampling/) for details. The closer to 0, the more discarded tokens.
@@ -47,7 +49,8 @@ For more information about the parameters, the [transformers documentation](http
 * **penalty_alpha**: Contrastive Search is enabled by setting this to greater than zero and unchecking "do_sample". It should be used with a low value of top_k, for instance, top_k = 4.
 * **mirostat_mode**: Activates the Mirostat sampling technique. It aims to control perplexity during sampling. See the [paper](https://arxiv.org/abs/2007.14966).
 * **mirostat_tau**: No idea, see the paper for details. According to the Preset Arena, 8 is a good value. 
-* **mirostat_tau**: No idea, see the paper for details. According to the Preset Arena, 0.1 is a good value.
+* **mirostat_eta**: No idea, see the paper for details. According to the Preset Arena, 0.1 is a good value.
+* **temperature_last**: Makes temperature the last sampler instead of the first. With this, you can remove low probability tokens with a sampler like min_p and then use a high temperature to make the model creative without losing coherency.
 * **do_sample**: When unchecked, sampling is entirely disabled, and greedy decoding is used instead (the most likely token is always picked).
 * **Seed**: Set the Pytorch seed to this number. Note that some loaders do not use Pytorch (notably llama.cpp), and others are not deterministic (notably ExLlama v1 and v2). For these loaders, the seed has no effect.
 * **encoder_repetition_penalty**: Also known as the "Hallucinations filter". Used to penalize tokens that are *not* in the prior text. Higher value = more likely to stay in context, lower value = more likely to diverge.

diff --git a/docs/04 ‐ Model Tab.md b/docs/04 ‐ Model Tab.md
@@ -29,6 +29,7 @@ Options:
 * **load-in-4bit**: Load the model in 4-bit precision using bitsandbytes.
 * **trust-remote-code**: Some models use custom Python code to load the model or the tokenizer. For such models, this option needs to be set. It doesn't download any remote content: all it does is execute the .py files that get downloaded with the model. Those files can potentially include malicious code; I have never seen it happen, but it is in principle possible.
 * **use_fast**: Use the "fast" version of the tokenizer. Especially useful for Llama models, which originally had a "slow" tokenizer that received an update. If your local files are in the old "slow" format, checking this option may trigger a conversion that takes several minutes. The fast tokenizer is mostly useful if you are generating 50+ tokens/second using ExLlama_HF or if you are tokenizing a huge dataset for training.
+* **use_flash_attention_2**: Set use_flash_attention_2=True while loading the model. Possibly useful for training.
 * **disable_exllama**: Only applies when you are loading a GPTQ model through the transformers loader. It needs to be checked if you intend to train LoRAs with the model.
 
 ### ExLlama_HF
@@ -42,6 +43,8 @@ ExLlama_HF is the v1 of ExLlama (https://github.com/turboderp/exllama) connected
 * **gpu-split**: If you have multiple GPUs, the amount of memory to allocate per GPU should be set in this field. Make sure to set a lower value for the first GPU, as that's where the cache is allocated.
 * **max_seq_len**: The maximum sequence length for the model. In ExLlama, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
 * **cfg-cache**: Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG in the "Parameters" > "Generation" tab. Checking this parameter doubles the cache VRAM usage.
+* **no_flash_attn**: Disables flash attention. Otherwise, it is automatically used as long as the library is installed.
+* **cache_8bit**: Create a 8-bit precision cache instead of a 16-bit one. This saves VRAM but increases perplexity (I don't know by how much).
 
 ### ExLlamav2_HF
 
@@ -86,7 +89,7 @@ Loads: GGUF models. Note: GGML models have been deprecated and do not work anymo
 Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
 
 * **n-gpu-layers**: The number of layers to allocate to the GPU. If set to 0, only the CPU will be used. If you want to offload all layers, you can simply set this to the maximum value.
-* **n-ctx**: Context length of the model. In llama.cpp, the context is preallocated, so the higher this value, the higher the RAM/VRAM usage will be. It gets automatically updated with the value in the GGUF metadata for the model when you select it in the Model dropdown.
+* **n_ctx**: Context length of the model. In llama.cpp, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on the metadata inside the GGUF file, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "n_ctx" so that you don't have to set the same thing twice.
 * **threads**: Number of threads. Recommended value: your number of physical cores. 
 * **threads_batch**: Number of threads for batch processing. Recommended value: your total number of cores (physical + virtual).
 * **n_batch**: Batch size for prompt processing. Higher values are supposed to make generation faster, but I have never obtained any benefit from changing this value.

diff --git a/download-model.py b/download-model.py
@@ -236,8 +236,7 @@ def check_model_files(self, model, branch, links, sha256, output_folder):
                 continue
 
             with open(output_folder / sha256[i][0], "rb") as f:
-                bytes = f.read()
-                file_hash = hashlib.sha256(bytes).hexdigest()
+                file_hash = hashlib.file_digest(f, "sha256").hexdigest()
                 if file_hash != sha256[i][1]:
                     print(f'Checksum failed: {sha256[i][0]}  {sha256[i][1]}')
                     validated = False

diff --git a/extensions/api/util.py b/extensions/api/util.py
@@ -25,14 +25,17 @@ def build_parameters(body, chat=False):
         'max_tokens_second': int(body.get('max_tokens_second', 0)),
         'do_sample': bool(body.get('do_sample', True)),
         'temperature': float(body.get('temperature', 0.5)),
+        'temperature_last': bool(body.get('temperature_last', False)),
         'top_p': float(body.get('top_p', 1)),
+        'min_p': float(body.get('min_p', 0)),
         'typical_p': float(body.get('typical_p', body.get('typical', 1))),
         'epsilon_cutoff': float(body.get('epsilon_cutoff', 0)),
         'eta_cutoff': float(body.get('eta_cutoff', 0)),
         'tfs': float(body.get('tfs', 1)),
         'top_a': float(body.get('top_a', 0)),
         'repetition_penalty': float(body.get('repetition_penalty', body.get('rep_pen', 1.1))),
-        'additive_repetition_penalty': float(body.get('additive_repetition_penalty', body.get('additive_rep_pen', 0))),
+        'presence_penalty': float(body.get('presence_penalty', body.get('presence_pen', 0))),
+        'frequency_penalty': float(body.get('frequency_penalty', body.get('frequency_pen', 0))),
         'repetition_penalty_range': int(body.get('repetition_penalty_range', 0)),
         'encoder_repetition_penalty': float(body.get('encoder_repetition_penalty', 1.0)),
         'top_k': int(body.get('top_k', 0)),

diff --git a/extensions/multimodal/abstract_pipeline.py b/extensions/multimodal/abstract_pipeline.py
@@ -3,6 +3,7 @@
 
 import torch
 from PIL import Image
+from transformers import is_torch_xpu_available
 
 
 class AbstractMultimodalPipeline(ABC):
@@ -55,7 +56,7 @@ def placeholder_embeddings() -> torch.Tensor:
 
     def _get_device(self, setting_name: str, params: dict):
         if params[setting_name] is None:
-            return torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+            return torch.device("cuda:0" if torch.cuda.is_available() else "xpu:0" if is_torch_xpu_available() else "cpu")
         return torch.device(params[setting_name])
 
     def _get_dtype(self, setting_name: str, params: dict):

diff --git a/extensions/openai/defaults.py b/extensions/openai/defaults.py
@@ -7,10 +7,13 @@
     'auto_max_new_tokens': False,
     'max_tokens_second': 0,
     'temperature': 1.0,
+    'temperature_last': False,
     'top_p': 1.0,
+    'min_p': 0,
     'top_k': 1,  # choose 20 for chat in absence of another default
     'repetition_penalty': 1.18,
-    'additive_repetition_penalty': 0,
+    'presence_penalty': 0,
+    'frequency_penalty': 0,
     'repetition_penalty_range': 0,
     'encoder_repetition_penalty': 1.0,
     'suffix': None,

diff --git a/extensions/sd_api_pictures/script.py b/extensions/sd_api_pictures/script.py
@@ -339,7 +339,7 @@ def ui():
                     height = gr.Slider(64, 2048, value=params['height'], step=64, label='Height')
                 with gr.Column(variant="compact", elem_id="sampler_col"):
                     with gr.Row(elem_id="sampler_row"):
-                        sampler_name = gr.Dropdown(value=params['sampler_name'], label='Sampling method', elem_id="sampler_box")
+                        sampler_name = gr.Dropdown(value=params['sampler_name'], allow_custom_value=True, label='Sampling method', elem_id="sampler_box")
                         create_refresh_button(sampler_name, lambda: None, lambda: {'choices': get_samplers()}, 'refresh-button')
                     steps = gr.Slider(1, 150, value=params['steps'], step=1, label="Sampling steps", elem_id="steps_box")
             with gr.Row():

diff --git a/models/config.yaml b/models/config.yaml
@@ -20,8 +20,6 @@
   model_type: 'dollyv2'
 .*replit:
   model_type: 'replit'
-.*AWQ:
-  n_batch: 1
 .*(oasst|openassistant-|stablelm-7b-sft-v7-epoch-3):
   instruction_template: 'Open Assistant'
   skip_special_tokens: false
@@ -47,9 +45,6 @@
 .*starchat-beta:
   instruction_template: 'Starchat-Beta'
   custom_stopping_strings: '"<|end|>"'
-.*(openorca-platypus2):
-  instruction_template: 'OpenOrca-Platypus2'
-  custom_stopping_strings: '"### Instruction:", "### Response:"'
 (?!.*v0)(?!.*1.1)(?!.*1_1)(?!.*stable)(?!.*chinese).*vicuna:
   instruction_template: 'Vicuna-v0'
 .*vicuna.*v0:
@@ -154,6 +149,9 @@
   instruction_template: 'Orca Mini'
 .*(platypus|gplatty|superplatty):
   instruction_template: 'Alpaca'
+.*(openorca-platypus2):
+  instruction_template: 'OpenOrca-Platypus2'
+  custom_stopping_strings: '"### Instruction:", "### Response:"'
 .*longchat:
   instruction_template: 'Vicuna-v1.1'
 .*vicuna-33b:

diff --git a/modules/AutoGPTQ_loader.py b/modules/AutoGPTQ_loader.py
@@ -1,5 +1,6 @@
 from pathlib import Path
 
+from accelerate.utils import is_xpu_available
 from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 
 import modules.shared as shared
@@ -41,7 +42,7 @@ def load_quantized(model_name):
     # Define the params for AutoGPTQForCausalLM.from_quantized
     params = {
         'model_basename': pt_path.stem,
-        'device': "cuda:0" if not shared.args.cpu else "cpu",
+        'device': "xpu:0" if is_xpu_available() else "cuda:0" if not shared.args.cpu else "cpu",
         'use_triton': shared.args.triton,
         'inject_fused_attention': not shared.args.no_inject_fused_attention,
         'inject_fused_mlp': not shared.args.no_inject_fused_mlp,
-Original file line number
+Diff line change
@@ Expand Up / @@ -26,6 +26,7 @@ @@
     .DS_Store
     .eslintrc.js
     .idea
+    .env
     .venv
     venv
     *.bak
@@ Expand Down @@