Help with Setting Up Text-Generation-WebUI for Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model #6735

P1LeFR · 2025-02-05T15:27:36Z

Optimizing Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model in Text-Generation-WebUI 2.4

I’ve recently been experimenting with large language models (LLMs) and was looking for an effective model for my system with 24 GB of VRAM.
After some research, I found the Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF model, which looks promising. (You can find it here https://huggingface.co/bartowski/mistralai_Mistral-Small-24B-Base-2501-GGUF).

Initial Success in LM Studio
In LM Studio, the model worked great right from the start, though responses were a bit short. But that’s not the focus of this post; this is mainly about running the model in Text-Generation-WebUI 2.4.

Issues with Model Performance in Text-Generation-WebUI

In Text-Generation-WebUI 2.4, the model loads fine, but the quality of answers is not ideal. I suspect it might be related to my settings, and I’m looking for advice on how to optimize the setup.

Here’s the initial setup:
Model Loading and Settings:

n-gpu-layers: Set to 41 (seems fine, I’ll keep it)
n_batch: 512 (this could potentially be increased to 1024 or 2048, but I’m not sure what’s best)
n-ctx: 32768 (this is the max allowed to stay within VRAM limits)
Cache_type: q8_0 (as the model uses Q8_0 for embedding and output weights)
Tensorcores: Checked (as the model is not using GGML_CUDA_FORCE_MMQ)
flash_attn: Should I check this? Is it a mistake?
mlock: Checked (to ensure the model stays in VRAM since it's almost full)

So let launch the model !

Model Launch Issues
When launching the model, I encountered the following log errors:

Exception` ignored on calling ctypes callback function: <function llama_log_callback at 0x000002526588A340>
Traceback (most recent call last):
  File "C:\text-generation-webui-2.4\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\_logger.py", line 39, in llama_log_callback
    print(text.decode("utf-8"), end="", flush=True, file=sys.stderr)
          ^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 128: invalid continuation `byte

=> Also, I noticed a lot of token issues, such as:

`init_tokenizer:` initializing tokenizer for type 2
load: control token:    475 '<SPECIAL_475>' is not marked as EOG
...
load: special_eos_id is not in special_eog_ids - the tokenizer config may be `incorrect`

Additionally, the last tensor was loaded onto the CPU instead of the GPU (as expected for all models):
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead

Despite these, the model loaded successfully, but the responses from the assistant were suboptimal:

Configuration Check
I checked all the parameters in the settings, including:

Chat tab all fine this is the default Assistant AI.

min_p as preset (looks good)
Grammar and text streaming: Both are turned off, which seems fine.
Chat tab: All settings are default, using the Assistant AI.

Instruction Template

This is where I tend to struggle, as I’m not sure about the best configuration for optimal performance. Normally, the instruction template should load automatically when the model loads, but I’m not sure if I’m missing something.

Here are the settings for the Instruction Template:

Command for Chat-Instruct Mode
The command for chat-instruct mode is standard:

Request for Help
Could anyone help me understand the best setup to optimize this model's performance?

I would greatly appreciate any advice or insights, particularly regarding:

Instruction Template Setup
Chat Template Configuration
Custom System Messages
Command for Chat-Instruct Mode
Grammar Settings
Character Setup for a Good Assistant or Roleplay

Many thanks in advance to anyone willing to help out!

The text was updated successfully, but these errors were encountered:

jensgreven · 2025-02-05T16:53:12Z

From the model card:

#Note 1: We recommond using a relatively low temperature, such as temperature=0.15.

Maybe that can help to get better results.

Personally, I found other models more likeable for RP / Assistant, e.g. Cohere for AI Command-R. You mentioned you have a 24GB VRAM GPU, then the Q4_K_M GGUF or Q4_K_S should run fine.

P1LeFR changed the title ~~Help to setup Text-generation-webui for mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF~~ Help with Setting Up Text-Generation-WebUI for Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with Setting Up Text-Generation-WebUI for Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model #6735

Help with Setting Up Text-Generation-WebUI for Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model #6735

P1LeFR commented Feb 5, 2025

jensgreven commented Feb 5, 2025 •

edited

Loading

Help with Setting Up Text-Generation-WebUI for Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model #6735

Help with Setting Up Text-Generation-WebUI for Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model #6735

Comments

P1LeFR commented Feb 5, 2025

jensgreven commented Feb 5, 2025 • edited Loading

jensgreven commented Feb 5, 2025 •

edited

Loading