Releases · retarfi/language-pretraining · GitHub

28 Apr 11:28

retarfi

v2.2.1 Latest

Latest

Able to select sentencepiece algorithm
Able to use multiprocessing in create_datasets.py
Move ELECTRA model file into models directory
Add DeBERTaV3 (alpha) implementation
- This implementation does back propagation of generator and discriminator at the same time
- In my experiment, models from this implementation perform worse than the models with my DeBERTaV2 implementation
- So the implementation needs to be improved, however, I don't have time to put effort into this.

Assets 2

04 Mar 15:41

retarfi

v2.2.0

Main changes are following:

jptranstokenizer is used for tokenizer
- It enables other word tokenizers such as Juman++, Sudachi, and spacy LUW.
requirements.txt to pyproject.toml
- This is unstable, especially the PyTorch part, and should be changed according to your own environment.
- If you get an error in run_pretraining.py, it may be due to pydantic. updating pydandic to the latest version may solve the problem, although the compatibility does not match.
Add Pre-mask option
- To use this option, please specify --mask_style and use --is_dataset_masked option in run_pretraining.py.
Add DeBERTa and DeBERTaV2
Change license from Apache 2.0 to MIT

There are more changes in detail.
Please read Readme.md.

Assets 2

06 Aug 11:50

retarfi

v2.1.0

Add RoBERTa and DeBERTa architecture (not confirmed, only added) (8fce71d9d14f974f445f10b632d4c57dd984ee5a
Update dataset construction
- Deal with wasting memory when pre-training (358aa61087c6712fa0c85afc35892a4d2f862a9e and 7fa51594249fe206e8aa059c94bf7eee130825f9)
- More effective linebyline 3bb47ea8f446e0e33a5e82143bd2f7393867e192

Assets 2

06 Jun 00:41

retarfi

v2.0.0

Apply Hugging Face's datasets library
https://github.com/retarfi/language-pretraining/tree/336c3699679dd59be788acc21f83188efa76b95b

New features:

Apply datasets library
- You need to run create_datasets.py before running run_pretraining.py
- Check README.md#Create Dataset for how to run create_datasets.py
Log losses of discriminator and generator of ELECTRA
Additional pre-training from a checkpoint is avaiable
- Check README.md#Additional Pre-training for setting in detail

Assets 2

06 Oct 09:52

retarfi

v1.0

First release

Assets 2