-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding MistralForCausalLM architecture to convert-hf-to-gguf.py #4463
Comments
See: #3867 (comment) Sliding Window Attention is not actually supported. If you look at all the existing quants, they are converted as LLama instead and surprisingly it works. |
Trying to convert this model to gguf |
@DifferentialityDevelopment For what it's worth, I can get convert to run by modifying the following line in conver-hf-to-gguf.py. No guarantees that the gguf file will work as expected.
|
The architectures for llama and mistral are fundamentally the same. Mistral is based on the llama architecture which is why it functions as expected when specified as llama. MoE support was added so Mixtral could work.
I'll admit that I don't know how the model is executed under the hood, as this is all handled by the GGML backend once the model is converted. I'm working my way there, though. You can always open up a PR to support the hf script if that's what you're aiming for. Torch tensor dump: 21:16:07 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.cli.dump.torch --all-tensor-names models/mistralai/Mistral-7B-Instruct-v0.2
[?] Select model files:
> [X] models/mistralai/Mistral-7B-Instruct-v0.2/pytorch_model-00001-of-00003.bin
[ ] models/mistralai/Mistral-7B-Instruct-v0.2/pytorch_model-00003-of-00003.bin
[ ] models/mistralai/Mistral-7B-Instruct-v0.2/pytorch_model-00002-of-00003.bin
model.embed_tokens.weight
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.v_proj.weight
model.layers.0.self_attn.o_proj.weight
model.layers.0.mlp.gate_proj.weight
model.layers.0.mlp.up_proj.weight
model.layers.0.mlp.down_proj.weight
model.layers.0.input_layernorm.weight
model.layers.0.post_attention_layernorm.weight
21:16:38 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.cli.dump.torch --all-tensor-names models/mistralai/Mixtral-8x7B-Instruct-v0.1
[?] Select model files:
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00006-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00007-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00004-of-00019.safetensors
> [X] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00001-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00009-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00019-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00012-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00014-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00017-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00015-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00011-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00018-of-00019.safetensors
[ ] models/mistralai/Mixtral-8x7B-Instruct-v0.1/model-00003-of-00019.safetensors
model.embed_tokens.weight
model.layers.0.block_sparse_moe.experts.0.w1.weight
model.layers.0.block_sparse_moe.experts.0.w2.weight
model.layers.0.block_sparse_moe.experts.0.w3.weight
model.layers.0.block_sparse_moe.experts.1.w1.weight
model.layers.0.block_sparse_moe.experts.1.w2.weight
model.layers.0.block_sparse_moe.experts.1.w3.weight
model.layers.0.block_sparse_moe.experts.2.w1.weight
model.layers.0.block_sparse_moe.experts.2.w2.weight
model.layers.0.block_sparse_moe.experts.2.w3.weight
model.layers.0.block_sparse_moe.experts.3.w1.weight
model.layers.0.block_sparse_moe.experts.3.w2.weight
model.layers.0.block_sparse_moe.experts.3.w3.weight
model.layers.0.block_sparse_moe.experts.4.w1.weight
model.layers.0.block_sparse_moe.experts.4.w2.weight
model.layers.0.block_sparse_moe.experts.4.w3.weight
model.layers.0.block_sparse_moe.experts.5.w1.weight
model.layers.0.block_sparse_moe.experts.5.w2.weight
model.layers.0.block_sparse_moe.experts.5.w3.weight
model.layers.0.block_sparse_moe.experts.6.w1.weight
model.layers.0.block_sparse_moe.experts.6.w2.weight
model.layers.0.block_sparse_moe.experts.6.w3.weight
model.layers.0.block_sparse_moe.experts.7.w1.weight
model.layers.0.block_sparse_moe.experts.7.w2.weight
model.layers.0.block_sparse_moe.experts.7.w3.weight
model.layers.0.block_sparse_moe.gate.weight
model.layers.0.input_layernorm.weight
model.layers.0.post_attention_layernorm.weight
model.layers.0.self_attn.k_proj.weight
model.layers.0.self_attn.o_proj.weight
model.layers.0.self_attn.q_proj.weight
model.layers.0.self_attn.v_proj.weight You can compare these results to dumping a gguf file: 21:20:07 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.cli.dump.gguf models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf
* Loading: models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 25 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
2: UINT64 | 1 | GGUF.tensor_count = 291
3: UINT64 | 1 | GGUF.kv_count = 22
4: STRING | 1 | general.architecture = 'llama'
5: STRING | 1 | general.name = 'mistralai'
6: UINT32 | 1 | llama.context_length = 32768
7: UINT32 | 1 | llama.embedding_length = 4096
8: UINT32 | 1 | llama.block_count = 32
9: UINT32 | 1 | llama.feed_forward_length = 14336
10: UINT32 | 1 | llama.rope.dimension_count = 128
11: UINT32 | 1 | llama.attention.head_count = 32
12: UINT32 | 1 | llama.attention.head_count_kv = 8
13: FLOAT32 | 1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
14: FLOAT32 | 1 | llama.rope.freq_base = 1000000.0
15: UINT32 | 1 | general.file_type = 1
16: STRING | 1 | tokenizer.ggml.model = 'llama'
17: [STRING] | 32000 | tokenizer.ggml.tokens
18: [FLOAT32] | 32000 | tokenizer.ggml.scores
19: [INT32] | 32000 | tokenizer.ggml.token_type
20: UINT32 | 1 | tokenizer.ggml.bos_token_id = 1
21: UINT32 | 1 | tokenizer.ggml.eos_token_id = 2
22: UINT32 | 1 | tokenizer.ggml.unknown_token_id = 0
23: BOOL | 1 | tokenizer.ggml.add_bos_token = True
24: BOOL | 1 | tokenizer.ggml.add_eos_token = False
25: STRING | 1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['"
* Dumping 291 tensor(s)
1: 131072000 | 4096, 32000, 1, 1 | F16 | token_embd.weight
2: 16777216 | 4096, 4096, 1, 1 | F16 | blk.0.attn_q.weight
3: 4194304 | 4096, 1024, 1, 1 | F16 | blk.0.attn_k.weight
4: 4194304 | 4096, 1024, 1, 1 | F16 | blk.0.attn_v.weight
5: 16777216 | 4096, 4096, 1, 1 | F16 | blk.0.attn_output.weight
6: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.weight
7: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.weight
8: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.weight
9: 4096 | 4096, 1, 1, 1 | F32 | blk.0.attn_norm.weight
10: 4096 | 4096, 1, 1, 1 | F32 | blk.0.ffn_norm.weight
21:20:13 | /mnt/valerie/pygptprompt
(.venv) git:(main | Δ) λ python -m pygptprompt.cli.dump.gguf models/mistralai/Mixtral-8x7B-Instruct-v0.1/ggml-model-f16.gguf
* Loading: models/mistralai/Mixtral-8x7B-Instruct-v0.1/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 27 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
2: UINT64 | 1 | GGUF.tensor_count = 995
3: UINT64 | 1 | GGUF.kv_count = 24
4: STRING | 1 | general.architecture = 'llama'
5: STRING | 1 | general.name = 'mistralai'
6: UINT32 | 1 | llama.context_length = 32768
7: UINT32 | 1 | llama.embedding_length = 4096
8: UINT32 | 1 | llama.block_count = 32
9: UINT32 | 1 | llama.feed_forward_length = 14336
10: UINT32 | 1 | llama.rope.dimension_count = 128
11: UINT32 | 1 | llama.attention.head_count = 32
12: UINT32 | 1 | llama.attention.head_count_kv = 8
13: FLOAT32 | 1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
14: UINT32 | 1 | llama.expert_count = 8
15: UINT32 | 1 | llama.expert_used_count = 2
16: FLOAT32 | 1 | llama.rope.freq_base = 1000000.0
17: UINT32 | 1 | general.file_type = 1
18: STRING | 1 | tokenizer.ggml.model = 'llama'
19: [STRING] | 32000 | tokenizer.ggml.tokens
20: [FLOAT32] | 32000 | tokenizer.ggml.scores
21: [INT32] | 32000 | tokenizer.ggml.token_type
22: UINT32 | 1 | tokenizer.ggml.bos_token_id = 1
23: UINT32 | 1 | tokenizer.ggml.eos_token_id = 2
24: UINT32 | 1 | tokenizer.ggml.unknown_token_id = 0
25: BOOL | 1 | tokenizer.ggml.add_bos_token = True
26: BOOL | 1 | tokenizer.ggml.add_eos_token = False
27: STRING | 1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if (message['"
* Dumping 995 tensor(s)
1: 131072000 | 4096, 32000, 1, 1 | F16 | token_embd.weight
2: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.0.weight
3: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.0.weight
4: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.0.weight
5: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.1.weight
6: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.1.weight
7: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.1.weight
8: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.2.weight
9: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.2.weight
10: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.2.weight
11: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.3.weight
12: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.3.weight
13: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.3.weight
14: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.4.weight
15: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.4.weight
16: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.4.weight
17: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.5.weight
18: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.5.weight
19: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.5.weight
20: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.6.weight
21: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.6.weight
22: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.6.weight
23: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_gate.7.weight
24: 58720256 | 14336, 4096, 1, 1 | F16 | blk.0.ffn_down.7.weight
25: 58720256 | 4096, 14336, 1, 1 | F16 | blk.0.ffn_up.7.weight
26: 32768 | 4096, 8, 1, 1 | F16 | blk.0.ffn_gate_inp.weight
27: 4096 | 4096, 1, 1, 1 | F32 | blk.0.attn_norm.weight
28: 4096 | 4096, 1, 1, 1 | F32 | blk.0.ffn_norm.weight
29: 4194304 | 4096, 1024, 1, 1 | F16 | blk.0.attn_k.weight
30: 16777216 | 4096, 4096, 1, 1 | F16 | blk.0.attn_output.weight
31: 16777216 | 4096, 4096, 1, 1 | F16 | blk.0.attn_q.weight
32: 4194304 | 4096, 1024, 1, 1 | F16 | blk.0.attn_v.weight |
How to do that ? I tried and get :
|
This issue is stale because it has been open for 30 days with no activity. |
Any solution for mistral? |
@munish0838 just use
then quantize as normal |
Thank you @cognitivetech |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Hi.
I'm trying to deploy Mistral-7B-v0.1 locally, as the documentation mentions it's supported, but it fails to generate the GGUF model file.
The error is
NotImplementedError: Architecture "MistralForCausalLM" not supported!
As using GGUF files is a breaking change and the Mistral-7B model should be supported, I think adding support for
MistralForCausalLM
architecture toconvert-hf-to-gguf.py
is essential.I'm running the latest version as of Dec 14, 2023, which is
b1637
The text was updated successfully, but these errors were encountered: