Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding MistralForCausalLM architecture to convert-hf-to-gguf.py #4463

Closed
moo-aly opened this issue Dec 14, 2023 · 12 comments
Closed

Adding MistralForCausalLM architecture to convert-hf-to-gguf.py #4463

moo-aly opened this issue Dec 14, 2023 · 12 comments
Labels
enhancement New feature or request stale

Comments

@moo-aly
Copy link

moo-aly commented Dec 14, 2023

Hi.

I'm trying to deploy Mistral-7B-v0.1 locally, as the documentation mentions it's supported, but it fails to generate the GGUF model file.

The error is NotImplementedError: Architecture "MistralForCausalLM" not supported!

As using GGUF files is a breaking change and the Mistral-7B model should be supported, I think adding support for MistralForCausalLM architecture to convert-hf-to-gguf.py is essential.

I'm running the latest version as of Dec 14, 2023, which is b1637

@moo-aly moo-aly added the enhancement New feature or request label Dec 14, 2023
@BarfingLemurs
Copy link
Contributor

BarfingLemurs commented Dec 14, 2023

per #4406,only the convert.py is used for this model. However, there was an attempt to allow for conversion in #4428
mistake

@Mrkvak
Copy link
Contributor

Mrkvak commented Dec 15, 2023

The question is about Mistral 7B, #4406 and #4428 were about Mixtral 8x7B.

@bullno1
Copy link
Contributor

bullno1 commented Dec 20, 2023

See: #3867 (comment)

Sliding Window Attention is not actually supported.
When the context goes beyond the window size, things will start to get weird.

If you look at all the existing quants, they are converted as LLama instead and surprisingly it works.

@DifferentialityDevelopment
Copy link
Contributor

DifferentialityDevelopment commented Jan 15, 2024

Trying to convert this model to gguf
https://huggingface.co/CallComply/openchat-3.5-0106-32k
But i'm getting the "Architecture "MistralForCausalLM" not supported!" error still.
The model has SWA disabled from what it looks like in the config.
If anyone has any leads or know what changes I need to make to get it to convert correctly, it would be really appreciated.

@nathanpbell
Copy link

nathanpbell commented Jan 15, 2024

@DifferentialityDevelopment For what it's worth, I can get convert to run by modifying the following line in conver-hf-to-gguf.py. No guarantees that the gguf file will work as expected.

--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -234,7 +234,7 @@ class Model:
             return gguf.MODEL_ARCH.STABLELM
         if arch == "QWenLMHeadModel":
             return gguf.MODEL_ARCH.QWEN
-        if arch == "MixtralForCausalLM":
+        if arch in ("MistralForCausalLM", "MixtralForCausalLM"):
             return gguf.MODEL_ARCH.LLAMA
         if arch == "GPT2LMHeadModel":
             return gguf.MODEL_ARCH.GPT2

@teleprint-me
Copy link
Contributor

teleprint-me commented Jan 16, 2024

The architectures for llama and mistral are fundamentally the same.

Mistral is based on the llama architecture which is why it functions as expected when specified as llama.

MoE support was added so Mixtral could work.

convert.py will convert any llama/mistral/mixtral model.

I'll admit that I don't know how the model is executed under the hood, as this is all handled by the GGML backend once the model is converted. I'm working my way there, though.

You can always open up a PR to support the hf script if that's what you're aiming for.

Torch tensor dump:

21:16:07 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.cli.dump.torch --all-tensor-names models/mistralai/Mistral-7B-Instruct-v0.2
[?] Select model files: 
 > [X] models/mistralai/Mistral-7B-Instruct-v0.2/pytorch_model-00001-of-00003.bin
   [ ] models/mistralai/Mistral-7B-Instruct-v0.2/pytorch_model-00003-of-00003.bin
   [ ] models/mistralai/Mistral-7B-Instruct-v0.2/pytorch_model-00002-of-00003.bin

model.embed_tokens.weight
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.o_proj.weight
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.down_proj.weight
model.layers.0.input_layernorm.weight
model.layers.0.post_attention_layernorm.weight
21:16:38 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.cli.dump.torch --all-tensor-names models/mistralai/Mixtral-8x7B-Instruct-v0.1
[?] Select model files: 
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00006-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00007-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00004-of-00019.safetensors
 > [X] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00001-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00009-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00019-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00012-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00014-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00017-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00015-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00011-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00018-of-00019.safetensors
   [ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00003-of-00019.safetensors

model.embed_tokens.weight
model.layers.0.block_sparse_moe.experts.0.w1.weight
model.layers.0.block_sparse_moe.experts.0.w2.weight
model.layers.0.block_sparse_moe.experts.0.w3.weight
model.layers.0.block_sparse_moe.experts.1.w1.weight
model.layers.0.block_sparse_moe.experts.1.w2.weight
model.layers.0.block_sparse_moe.experts.1.w3.weight
model.layers.0.block_sparse_moe.experts.2.w1.weight
model.layers.0.block_sparse_moe.experts.2.w2.weight
model.layers.0.block_sparse_moe.experts.2.w3.weight
model.layers.0.block_sparse_moe.experts.3.w1.weight
model.layers.0.block_sparse_moe.experts.3.w2.weight
model.layers.0.block_sparse_moe.experts.3.w3.weight
model.layers.0.block_sparse_moe.experts.4.w1.weight
model.layers.0.block_sparse_moe.experts.4.w2.weight
model.layers.0.block_sparse_moe.experts.4.w3.weight
model.layers.0.block_sparse_moe.experts.5.w1.weight
model.layers.0.block_sparse_moe.experts.5.w2.weight
model.layers.0.block_sparse_moe.experts.5.w3.weight
model.layers.0.block_sparse_moe.experts.6.w1.weight
model.layers.0.block_sparse_moe.experts.6.w2.weight
model.layers.0.block_sparse_moe.experts.6.w3.weight
model.layers.0.block_sparse_moe.experts.7.w1.weight
model.layers.0.block_sparse_moe.experts.7.w2.weight
model.layers.0.block_sparse_moe.experts.7.w3.weight
model.layers.0.block_sparse_moe.gate.weight
model.layers.0.input_layernorm.weight
model.layers.0.post_attention_layernorm.weight
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.o_proj.weight
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.v_proj.weight

You can compare these results to dumping a gguf file:

21:20:07 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.cli.dump.gguf models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf 
* Loading: models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 25 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 291
      3: UINT64     |        1 | GGUF.kv_count = 22
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = 'mistralai'
      6: UINT32     |        1 | llama.context_length = 32768
      7: UINT32     |        1 | llama.embedding_length = 4096
      8: UINT32     |        1 | llama.block_count = 32
      9: UINT32     |        1 | llama.feed_forward_length = 14336
     10: UINT32     |        1 | llama.rope.dimension_count = 128
     11: UINT32     |        1 | llama.attention.head_count = 32
     12: UINT32     |        1 | llama.attention.head_count_kv = 8
     13: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     14: FLOAT32    |        1 | llama.rope.freq_base = 1000000.0
     15: UINT32     |        1 | general.file_type = 1
     16: STRING     |        1 | tokenizer.ggml.model = 'llama'
     17: [STRING]   |    32000 | tokenizer.ggml.tokens
     18: [FLOAT32]  |    32000 | tokenizer.ggml.scores
     19: [INT32]    |    32000 | tokenizer.ggml.token_type
     20: UINT32     |        1 | tokenizer.ggml.bos_token_id = 1
     21: UINT32     |        1 | tokenizer.ggml.eos_token_id = 2
     22: UINT32     |        1 | tokenizer.ggml.unknown_token_id = 0
     23: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     24: BOOL       |        1 | tokenizer.ggml.add_eos_token = False
     25: STRING     |        1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['"

* Dumping 291 tensor(s)
      1:  131072000 |  4096, 32000,     1,     1 | F16     | token_embd.weight
      2:   16777216 |  4096,  4096,     1,     1 | F16     | blk.0.attn_q.weight
      3:    4194304 |  4096,  1024,     1,     1 | F16     | blk.0.attn_k.weight
      4:    4194304 |  4096,  1024,     1,     1 | F16     | blk.0.attn_v.weight
      5:   16777216 |  4096,  4096,     1,     1 | F16     | blk.0.attn_output.weight
      6:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.weight
      7:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.weight
      8:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.weight
      9:       4096 |  4096,     1,     1,     1 | F32     | blk.0.attn_norm.weight
     10:       4096 |  4096,     1,     1,     1 | F32     | blk.0.ffn_norm.weight
21:20:13 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.cli.dump.gguf models/mistralai/Mixtral-8x7B-Instruct-v0.1/ggml-model-f16.gguf 
* Loading: models/mistralai/Mixtral-8x7B-Instruct-v0.1/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 27 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 995
      3: UINT64     |        1 | GGUF.kv_count = 24
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = 'mistralai'
      6: UINT32     |        1 | llama.context_length = 32768
      7: UINT32     |        1 | llama.embedding_length = 4096
      8: UINT32     |        1 | llama.block_count = 32
      9: UINT32     |        1 | llama.feed_forward_length = 14336
     10: UINT32     |        1 | llama.rope.dimension_count = 128
     11: UINT32     |        1 | llama.attention.head_count = 32
     12: UINT32     |        1 | llama.attention.head_count_kv = 8
     13: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     14: UINT32     |        1 | llama.expert_count = 8
     15: UINT32     |        1 | llama.expert_used_count = 2
     16: FLOAT32    |        1 | llama.rope.freq_base = 1000000.0
     17: UINT32     |        1 | general.file_type = 1
     18: STRING     |        1 | tokenizer.ggml.model = 'llama'
     19: [STRING]   |    32000 | tokenizer.ggml.tokens
     20: [FLOAT32]  |    32000 | tokenizer.ggml.scores
     21: [INT32]    |    32000 | tokenizer.ggml.token_type
     22: UINT32     |        1 | tokenizer.ggml.bos_token_id = 1
     23: UINT32     |        1 | tokenizer.ggml.eos_token_id = 2
     24: UINT32     |        1 | tokenizer.ggml.unknown_token_id = 0
     25: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     26: BOOL       |        1 | tokenizer.ggml.add_eos_token = False
     27: STRING     |        1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['"

* Dumping 995 tensor(s)
      1:  131072000 |  4096, 32000,     1,     1 | F16     | token_embd.weight
      2:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.0.weight
      3:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.0.weight
      4:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.0.weight
      5:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.1.weight
      6:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.1.weight
      7:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.1.weight
      8:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.2.weight
      9:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.2.weight
     10:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.2.weight
     11:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.3.weight
     12:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.3.weight
     13:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.3.weight
     14:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.4.weight
     15:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.4.weight
     16:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.4.weight
     17:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.5.weight
     18:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.5.weight
     19:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.5.weight
     20:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.6.weight
     21:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.6.weight
     22:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.6.weight
     23:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_gate.7.weight
     24:   58720256 | 14336,  4096,     1,     1 | F16     | blk.0.ffn_down.7.weight
     25:   58720256 |  4096, 14336,     1,     1 | F16     | blk.0.ffn_up.7.weight
     26:      32768 |  4096,     8,     1,     1 | F16     | blk.0.ffn_gate_inp.weight
     27:       4096 |  4096,     1,     1,     1 | F32     | blk.0.attn_norm.weight
     28:       4096 |  4096,     1,     1,     1 | F32     | blk.0.ffn_norm.weight
     29:    4194304 |  4096,  1024,     1,     1 | F16     | blk.0.attn_k.weight
     30:   16777216 |  4096,  4096,     1,     1 | F16     | blk.0.attn_output.weight
     31:   16777216 |  4096,  4096,     1,     1 | F16     | blk.0.attn_q.weight
     32:    4194304 |  4096,  1024,     1,     1 | F16     | blk.0.attn_v.weight

@Entretoize
Copy link

Entretoize commented Jan 24, 2024

The architectures for llama and mistral are fundamentally the same.

Mistral is based on the llama architecture which is why it functions as expected when specified as llama.

MoE support was added so Mixtral could work.

convert.py will convert any llama/mistral/mixtral model.

How to do that ? I tried and get :

python3 convert.py MistralFunc-7b
Loading model: MistralFunc-7b
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type unk to 0
gguf: Setting special token type pad to 2
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to True
gguf: Setting chat_template to {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles mustalternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant'%}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
Exporting model to 'MistralFunc-7b/ggml-model-f16.gguf'
gguf: loading model part 'model-00001-of-00002.safetensors'
token_embd.weight, n_dims = 2, torch.float32 --> float16
blk.0.attn_norm.weight, n_dims = 1, torch.float32 --> float32
blk.0.ffn_down.weight, n_dims = 1, torch.uint8 --> float32
Can not map tensor 'model.layers.0.mlp.down_proj.weight.absmax'

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
@munish0838
Copy link

Any solution for mistral?

@github-actions github-actions bot removed the stale label Mar 24, 2024
@cognitivetech
Copy link

cognitivetech commented Mar 25, 2024

@munish0838 just use convert.py

python3 convert.py ../models/samantha-mistral-instruct-7b --outtype f16 --outfile ../models/samantha-mistral-instruct-7b.f16.bin

then quantize as normal

@munish0838
Copy link

Thank you @cognitivetech

@github-actions github-actions bot added the stale label Apr 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

10 participants