Skip to content

Script to train a Bert model and tokenizer on bulgarian data from scratch

License

Notifications You must be signed in to change notification settings

MarioMarkov/BulBert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scripts to train BulBERT

A LLM trained from scratch on bulgarian data.

The model and the model's tokenizer are trained from scratch on bulgarian data from the chitanka dataset.

The notebook has the code to load the raw dataset and train the tokenizer from scratch from my huggingface-hub: https://huggingface.co/datasets/mor40/chitanka_raw_document

Or you can load the tokenized dataset from here and start training: https://huggingface.co/datasets/mor40/tokenized_chitanka

The BERT config that it uses is this: vocab_size=50265 max_position_embeddings=512 num_attention_heads=12 num_hidden_layers=6

It is trained for 3 epochs and gets -> Perplexity: 6.75 (on eval set)

The trained model you can find here: https://huggingface.co/mor40/BulBERT-chitanka-model

About

Script to train a Bert model and tokenizer on bulgarian data from scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published