Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with Setting Up Text-Generation-WebUI for Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model #6735

Open
P1LeFR opened this issue Feb 5, 2025 · 1 comment

Comments

@P1LeFR
Copy link

P1LeFR commented Feb 5, 2025

Optimizing Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model in Text-Generation-WebUI 2.4

I’ve recently been experimenting with large language models (LLMs) and was looking for an effective model for my system with 24 GB of VRAM.
After some research, I found the Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF model, which looks promising. (You can find it here https://huggingface.co/bartowski/mistralai_Mistral-Small-24B-Base-2501-GGUF).

Initial Success in LM Studio
In LM Studio, the model worked great right from the start, though responses were a bit short. But that’s not the focus of this post; this is mainly about running the model in Text-Generation-WebUI 2.4.

Issues with Model Performance in Text-Generation-WebUI

In Text-Generation-WebUI 2.4, the model loads fine, but the quality of answers is not ideal. I suspect it might be related to my settings, and I’m looking for advice on how to optimize the setup.

Here’s the initial setup:
Model Loading and Settings:

Image

  • n-gpu-layers: Set to 41 (seems fine, I’ll keep it)
  • n_batch: 512 (this could potentially be increased to 1024 or 2048, but I’m not sure what’s best)
  • n-ctx: 32768 (this is the max allowed to stay within VRAM limits)
  • Cache_type: q8_0 (as the model uses Q8_0 for embedding and output weights)
  • Tensorcores: Checked (as the model is not using GGML_CUDA_FORCE_MMQ)
  • flash_attn: Should I check this? Is it a mistake?
  • mlock: Checked (to ensure the model stays in VRAM since it's almost full)

So let launch the model !

Model Launch Issues
When launching the model, I encountered the following log errors:

Exception` ignored on calling ctypes callback function: <function llama_log_callback at 0x000002526588A340>
Traceback (most recent call last):
  File "C:\text-generation-webui-2.4\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\_logger.py", line 39, in llama_log_callback
    print(text.decode("utf-8"), end="", flush=True, file=sys.stderr)
          ^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 128: invalid continuation `byte

=> Also, I noticed a lot of token issues, such as:

`init_tokenizer:` initializing tokenizer for type 2
load: control token:    475 '<SPECIAL_475>' is not marked as EOG
...
load: special_eos_id is not in special_eog_ids - the tokenizer config may be `incorrect`

Additionally, the last tensor was loaded onto the CPU instead of the GPU (as expected for all models):
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead

Despite these, the model loaded successfully, but the responses from the assistant were suboptimal:

Image

Configuration Check
I checked all the parameters in the settings, including:

Image

Chat tab all fine this is the default Assistant AI.

  • min_p as preset (looks good)
  • Grammar and text streaming: Both are turned off, which seems fine.
  • Chat tab: All settings are default, using the Assistant AI.

Instruction Template

This is where I tend to struggle, as I’m not sure about the best configuration for optimal performance. Normally, the instruction template should load automatically when the model loads, but I’m not sure if I’m missing something.

Here are the settings for the Instruction Template:

Image

Command for Chat-Instruct Mode
The command for chat-instruct mode is standard:

Image

Request for Help
Could anyone help me understand the best setup to optimize this model's performance?

I would greatly appreciate any advice or insights, particularly regarding:

  • Instruction Template Setup
  • Chat Template Configuration
  • Custom System Messages
  • Command for Chat-Instruct Mode
  • Grammar Settings
  • Character Setup for a Good Assistant or Roleplay

Many thanks in advance to anyone willing to help out!

@P1LeFR P1LeFR changed the title Help to setup Text-generation-webui for mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Help with Setting Up Text-Generation-WebUI for Mistralai_Mistral-Small-24B-Base-2501-Q6_K_L.GGUF Model Feb 5, 2025
@jensgreven
Copy link

jensgreven commented Feb 5, 2025

From the model card:

#Note 1: We recommond using a relatively low temperature, such as temperature=0.15.

Maybe that can help to get better results.

Personally, I found other models more likeable for RP / Assistant, e.g. Cohere for AI Command-R. You mentioned you have a 24GB VRAM GPU, then the Q4_K_M GGUF or Q4_K_S should run fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants