A wrapper for Huggingface's Tokenizers library , for it to be used along with existing version of Huggingface's Transformers.Tokenizers from Tokenizers library, are much faster compared to Transformers' native tokenizers. This wrapper FastTokenizers.py
can be used along with existing version of transformers library.
BertTokenizerFast
and DistilBertTokenizerFast
are the wrappers for bert and distilBert tokenizers using tokenizers library.
Usage is very similar to BertTokenizer and DistilBertTokenizer class in transformers library.
from FastTokenizers import DistilBertTokenizerFast,BertTokenizerFast
# Tokenizer can be initialized without a vocab file as in Transformers library.
fastDistilTokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased',
do_lower_case=True,
cache_dir=None)
fastBertTokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased',
do_lower_case=True,
cache_dir=None)
LM finetuning is much faster with tokenizers, run_lm_finetuning.py
script is updated with FastTokenizers. Invoking and usage of the script is as same as the original script on Huggingface's Transformers
The scripts were adapted from Huggingface's Transformers library.Inspired from yet to be released, Huggingface's BertTokenizerFast.