Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PLM GGUF Conversion & Inference Support #12457

Open
wants to merge 39 commits into
base: master
Choose a base branch
from
Open

Conversation

Si1w
Copy link

@Si1w Si1w commented Mar 18, 2025

This PR adds HF->GGUF conversion & inference support for PLM Model PLM-1.8B-Instruct

The Model has already been converted into gguf form with quantized and tested PLM-1.8B-Instruct-gguf, PLM-1.8B-Instruct-id-gguf

The Model Arch is similar with Deepseek V2 and Minicpm3. The key points of the model are:

  • Sparse FFN: PLM uses Squared ReLU (up and down projections)
  • MLA: PLM uses Multi-head Latent Attention

The details of the model can be seen in the following Paper

PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing


Self-reported review complexity:

  • Low
  • Medium
  • High

@github-actions github-actions bot added the python python script changes label Mar 18, 2025
@arch-btw
Copy link
Contributor

Tested both the premade gguf and converting the gguf, both work 👍

Looks like it's using the qwen2 tokenizer with the associated chatml prompt template:

llama_model_loader: - kv   0:                       general.architecture str              = plm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = PLM 1.8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = PLM
llama_model_loader: - kv   5:                         general.size_label str              = 1.8B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                            plm.block_count u32              = 32
llama_model_loader: - kv   8:                         plm.context_length u32              = 4096
llama_model_loader: - kv   9:                       plm.embedding_length u32              = 2048
llama_model_loader: - kv  10:                    plm.feed_forward_length u32              = 8192
llama_model_loader: - kv  11:                   plm.attention.head_count u32              = 16
llama_model_loader: - kv  12:                plm.attention.head_count_kv u32              = 16
llama_model_loader: - kv  13:                         plm.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  14:       plm.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                             plm.vocab_size u32              = 151936
llama_model_loader: - kv  16:                 plm.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  17:                   plm.attention.key_length u32              = 192
llama_model_loader: - kv  18:                 plm.attention.value_length u32              = 128
llama_model_loader: - kv  19:                   plm.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  176 tensors
llama_model_loader: - type q6_K:   17 tensors

The only small error was a brief switch in language, but that's probably not related to this PR:

> hello!
通往成功

> How are you?
I'm doing well, thank you! How about you? How can I help you today?

convert_hf_to_gguf.py output:

python convert_hf_to_gguf.py /home/test/PLM-1.8B-Instruct --outtype f32
.....
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 4096
INFO:hf-to-gguf:gguf: embedding length = 2048
INFO:hf-to-gguf:gguf: feed forward length = 8192
INFO:hf-to-gguf:gguf: head count = 16
INFO:hf-to-gguf:gguf: key-value head count = 16
INFO:hf-to-gguf:gguf: rope theta = 100000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 0
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to 151643
INFO:gguf.vocab:Setting special token type pad to 151643
INFO:gguf.vocab:Setting special token type bos to 151643
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/test/PLM-1.8B-Instruct/PLM-1.8B-Instruct-F32.gguf: n_tensors = 290, total_size = 7.3G
Writing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30G/7.30G [00:11<00:00, 643Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/test/PLM-1.8B-Instruct/PLM-1.8B-Instruct-F32.gguf

@ngxson ngxson requested a review from ggerganov March 21, 2025 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants