Introduction
Model download
Run the model
Fine-tuning the model
Limitations

PhoGPT: Generative Pre-training for Vietnamese

We open-source a state-of-the-art 4B-parameter generative model series for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. The base model, PhoGPT-4B, with exactly 3.7B parameters, is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length, employing a vocabulary of 20K token types. The chat variant, PhoGPT-4B-Chat, is the modeling output obtained by fine-tuning PhoGPT-4B on a dataset of 70K instructional prompts and their responses, along with an additional 290K conversations. We demonstrate its superior performance compared to previous open-source models.

More details about the general architecture and experimental results of PhoGPT can be found in our technical report. All output responses of PhoGPT and baselines are available HERE for readers' self-evaluation. Please CITE our technical report when PhoGPT is used to help produce published results or is incorporated into other software:

@article{PhoGPT,
title     = {{PhoGPT: Generative Pre-training for Vietnamese}},
author    = {Dat Quoc Nguyen and Linh The Nguyen and Chi Tran and Dung Ngoc Nguyen and Dinh Phung and Hung Bui},
journal   = {arXiv preprint},
volume    = {arXiv:2311.02945},
year      = {2023}
}

Model download

Model	Type	Model Size	Context length	Vocab size	Training data size	Note
`vinai/PhoGPT-4B`	Base	3.7B	8192	20K	2 training epochs on 482GB of texts	Loading "PhoGPT-4B" or "PhoGPT-4B-Chat" in float16 takes 7GB of GPU memory
`vinai/PhoGPT-4B-Chat`	Instruction following & Chat	3.7B	8192	20K	70K instructional prompt and response pairs & 290K conversations	`PROMPT_TEMPLATE = "### Câu hỏi: {instruction}\n### Trả lời:"`

Run the model

With vLLM, Text Generation Inference & llama.cpp

PhoGPT can run with inference engines, such as vLLM, Text Generation Inference and llama.cpp.

With llama.cpp

Compile llama.cpp
Install Python dependencies from llama.cpp

cd llama.cpp
python3 -m pip install -r requirements.txt

Convert the model to gguf FP16 format: python3 convert-hf-to-gguf.py <path_to_PhoGPT-4B-Chat_model> --outfile ./PhoGPT-4B-Chat.gguf
(Optional) Quantize the model to 4/8-bits:
- ./quantize ./PhoGPT-4B-Chat.gguf ./PhoGPT-4B-Chat-Q4_K_M.gguf Q4_K_M
- ./quantize ./PhoGPT-4B-Chat.gguf ./PhoGPT-4B-Chat-Q8_0.gguf Q8_0
Start inference on a gguf model: ./main -m ./PhoGPT-4B-Chat-Q4_K_M.gguf -n 1024 -p "### Câu hỏi: Viết bài văn nghị luận xã hội về an toàn giao thông\n### Trả lời:"

Converted gguf files are available at: vinai/PhoGPT-4B-Chat-gguf. Note that phogpt_4b_chat_preset.json might be needed for LM Studio to work properly with our gguf files.

With pure `transformers`

Instruction following

# coding: utf8
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

model_path = "vinai/PhoGPT-4B-Chat"  

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)  
config.init_device = "cuda"
# config.attn_config['attn_impl'] = 'flash' # If installed: this will use either Flash Attention V1 or V2 depending on what is installed

model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True)
# If your GPU does not support bfloat16:
# model = AutoModelForCausalLM.from_pretrained(model_path, config=config, torch_dtype=torch.float16, trust_remote_code=True)
model.eval()  

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)  

PROMPT_TEMPLATE = "### Câu hỏi: {instruction}\n### Trả lời:"  

# Some instruction examples
# instruction = "Viết bài văn nghị luận xã hội về {topic}"
# instruction = "Viết bản mô tả công việc cho vị trí {job_title}"
# instruction = "Sửa lỗi chính tả:\n{sentence_or_paragraph}"
# instruction = "Dựa vào văn bản sau đây:\n{text}\nHãy trả lời câu hỏi: {question}"
# instruction = "Tóm tắt văn bản:\n{text}"

instruction = "Viết bài văn nghị luận xã hội về an toàn giao thông"
# instruction = "Sửa lỗi chính tả:\nTriệt phá băng nhóm kướp ô tô, sử dụng \"vũ khí nóng\""

input_prompt = PROMPT_TEMPLATE.format_map({"instruction": instruction})  

input_ids = tokenizer(input_prompt, return_tensors="pt")  

outputs = model.generate(  
    inputs=input_ids["input_ids"].to("cuda"),  
    attention_mask=input_ids["attention_mask"].to("cuda"),  
    do_sample=True,  
    temperature=1.0,  
    top_k=50,  
    top_p=0.9,  
    max_new_tokens=1024,  
    eos_token_id=tokenizer.eos_token_id,  
    pad_token_id=tokenizer.pad_token_id  
)  

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]  
response = response.split("### Trả lời:")[1]

Chat

messages = [
    {"role": "user", "content": "Kể tên một môn thể thao mạo hiểm"},
    {"role": "assistant", "content": "Nhảy Bungee."},
    {"role": "user", "content": "Bạn đã bao giờ đi nhảy bungee chưa"}
]

# Using apply_chat_template
tokenizer = AutoTokenizer.from_pretrained("vinai/PhoGPT-4B-Chat", trust_remote_code=True)
input_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

quantization with `bitsandbytes`

import torch
from transformers import BitsAndBytesConfig, AutoConfig, AutoModelForCausalLM, AutoTokenizer

config = AutoConfig.from_pretrained("vinai/PhoGPT-4B-Chat", trust_remote_code=True)  
config.init_device = "cuda"

# 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained("vinai/PhoGPT-4B-Chat", config=config, load_in_8bit=True)

Fine-tuning the model

See llm-foundry docs for details. To fully fine-tune PhoGPT, users can find an example of model finetuning YAML configuration at fine-tuning-phogpt.yaml. Users can also find the sample_instruction_following_dataset folder as an example of an instruction-following dataset.

To install llm-foundry, see Section "Installation" in https://github.com/mosaicml/llm-foundry.
Run: cd llm-foundry/scripts/train/ and then composer --world_size <number_of_GPUs> train.py <path_to_yaml_configuration_file> (e.g. composer --world_size 1 train.py fine-tuning-phogpt.yaml).

Other fine-tuning options may include the use of transformers's Trainer (e.g. see stanford_alpaca as an example), lit-gpt or LLaMA-Factory.

Limitations

PhoGPT has certain limitations. For example, it is not good at tasks involving reasoning, coding or mathematics. PhoGPT may generate harmful, hate speech, biased responses, or answer unsafe questions. Users should be cautious when interacting with PhoGPT that can produce factually incorrect output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PhoGPT: Generative Pre-training for Vietnamese

Model download

Run the model

With vLLM, Text Generation Inference & llama.cpp

With llama.cpp

With pure `transformers`

Instruction following

Chat

quantization with `bitsandbytes`

Fine-tuning the model

Limitations

Files

README.md

Latest commit

History

README.md

File metadata and controls

PhoGPT: Generative Pre-training for Vietnamese

Model download

Run the model

With vLLM, Text Generation Inference & llama.cpp

With llama.cpp

With pure transformers

Instruction following

Chat

quantization with bitsandbytes

Fine-tuning the model

Limitations

With pure `transformers`

quantization with `bitsandbytes`