Scripts to train BulBERT

The notebook has the code to load the raw dataset and train the tokenizer from scratch from my huggingface-hub: https://huggingface.co/datasets/mor40/chitanka_raw_document

Or you can load the tokenized dataset from here and start training: https://huggingface.co/datasets/mor40/tokenized_chitanka

The BERT config that it uses is this: vocab_size=50265 max_position_embeddings=512 num_attention_heads=12 num_hidden_layers=6

It is trained for 3 epochs and gets -> Perplexity: 6.75 (on eval set)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
BULBERT_train.ipynb		BULBERT_train.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback