Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrain a Mistral Architecture Model with SFT Trainer in 70 Lines of Python #770

Open
1 task
irthomasthomas opened this issue Mar 16, 2024 · 1 comment
Open
1 task
Labels
code-generation code generation models and tools like copilot and aider dataset public datasets and embeddings llm Large Language Models python Python code, tools, info

Comments

@irthomasthomas
Copy link
Owner

  • [Based on the provided code snippet and URL, a suitable title for this bookmark could be:

"Pretrain a Mistral Architecture Model with SFT Trainer in 70 Lines of Python"

This title captures the key aspects of the content, which is a demonstration of how to pretrain a Mistral architecture model using the SFT Trainer in a concise Python script.](https://huggingface.co/cloudyu/mistral_pretrain_demo)

"Pretrain a Mistral Architecture Model with SFT Trainer in 70 Lines of Python"

Description

This is a demo of how to pretrain a mistral architecture model by SFT Trainer ,and it needs only 70 lines Python code.

import torch
from transformers import TrainingArguments, MistralForCausalLM, MistralModel, MistralConfig, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer

configuration = MistralConfig(vocab_size=32000,
        hidden_size=2048,
        intermediate_size=7168,
        num_hidden_layers=24,
        num_attention_heads=32,
        num_key_value_heads=8,
        hidden_act="silu",
        max_position_embeddings=4096,
        pad_token_id=2,
        bos_token_id=1,
        eos_token_id=2)

model = MistralForCausalLM(configuration)
#model = MistralForCausalLM.from_pretrained("./6B_code_outputs/checkpoint-10000")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", local_files_only=False)
tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset('HuggingFaceTB/cosmopedia-20k', split="train")
#dataset = load_dataset('Elriggs/openwebtext-100k', split="train")
dataset = dataset.shuffle(seed=42)
print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction', 'context', 'response')
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """
    output_texts = []
    for i in range(len(sample['text'])):
      formatted_prompt = sample['text'][i]
      output_texts.append(formatted_prompt)
    #print(output_texts)
    return output_texts


trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    tokenizer = tokenizer,
    max_seq_length=2048,
    formatting_func=create_prompt_formats,
    args=TrainingArguments(
            per_device_train_batch_size=2,
            gradient_accumulation_steps=1,
            warmup_steps=2,
            max_steps=10000,
            learning_rate=1e-4,
            logging_steps=1,
            output_dir="1B_outputs", overwrite_output_dir=True,save_steps=1000,
            optim="paged_adamw_32bit",report_to="none"
        )
)
trainer.train()
trainer.model.save_pretrained("1B-final", dtype=torch.float32)
trainer.tokenizer.save_pretrained("1B-final")

URL

https://huggingface.co/cloudyu/mistral_pretrain_demo

Suggested labels

@irthomasthomas irthomasthomas added code-generation code generation models and tools like copilot and aider dataset public datasets and embeddings llm Large Language Models python Python code, tools, info labels Mar 16, 2024
@irthomasthomas
Copy link
Owner Author

Related content

#324 similarity score: 0.91
#383 similarity score: 0.9
#660 similarity score: 0.89
#762 similarity score: 0.89
#499 similarity score: 0.89
#625 similarity score: 0.89

@irthomasthomas irthomasthomas changed the title Based on the provided code snippet and URL, a suitable title for this bookmark could be: "Pretrain a Mistral Architecture Model with SFT Trainer in 70 Lines of Python" This title captures the key aspects of the content, which is a demonstration of how to pretrain a Mistral architecture model using the SFT Trainer in a concise Python script. Pretrain a Mistral Architecture Model with SFT Trainer in 70 Lines of Python Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-generation code generation models and tools like copilot and aider dataset public datasets and embeddings llm Large Language Models python Python code, tools, info
Projects
None yet
Development

No branches or pull requests

1 participant