Skip to content

Latest commit

 

History

History
215 lines (177 loc) · 5.79 KB

README.md

File metadata and controls

215 lines (177 loc) · 5.79 KB

HugTokenCraft

HugTokenCraft is a user-friendly Python library that simplifies the process of modifying the vocabulary of a PreTrainedTokenizer from HuggingFace Transformers, making it accessible without additional training. As of now, this was validated for BertTokenizer, which is word-piece-based vocabulary.

Why choose HugTokenCraft?

While adding new tokens to a pre-trained tokenizer is relatively simple, removing tokens however, is not straightforward. Specially, if you want to remove majority of the tokens, there will be inconsistencies with the special token ids. HugTokenCraft makes these operations very simple.

Useful for

  • Creating artifical language from existing one for Language models
  • Editing existing vocabulary

Features

  • Remove tokens from a pre-trained Tokenizer
  • Add tokens to a pre-trained Tokenizer
  • Change the maximum token length
  • Works even when majority of tokens are removed

Installation

Install from PyPI

You can install HugTokenCraft using pip:

pip install hugtokencraft

Install from source

git clone git@github.com/MDFahimAnjum/HugTokenCraft.git
cd HugTokenCraft
python setup.py install

Usage

1. Reduce vocabulary

Let's take a pre-trained BertTokenizer which has 30,000 tokens and modify it to only keep 20 tokens

#import library
from hugtokencraft import editor
from transformers import BertTokenizer
import os

#load BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

#check
initial_vocab_size=len(tokenizer)
print(f"initial vocab size: {initial_vocab_size}")

#Target vocabulary
target_vocab_size=20
selected_words=editor.get_top_tokens(tokenizer,target_vocab_size)

#parameters
current_directory = os.getcwd()
# Define the path where you want to save the tokenizer
tokenizer_path = os.path.join(current_directory,"ModifiedTokenizer")
model_max_length=128

#reduce vocabulary
modified_tokenizer=editor.reduce_vocabulary(tokenizer,selected_words)
tokenizer_path=editor.save_tokenizer(modified_tokenizer,tokenizer_path,model_max_length)
modified_tokenizer=editor.load_tokenizer(type(tokenizer),tokenizer_path)

#check
new_vocab_size=len(modified_tokenizer)
print(f"new vocab size: {new_vocab_size} words")

2. Expand vocabulary

Let's take a pre-trained BertTokenizer and add two new tokens

#import library
from hugtokencraft import editor
from transformers import BertTokenizer
import os

#load BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

#check
initial_vocab_size=len(tokenizer)
print(f"initial vocab size: {initial_vocab_size}")

#Target vocabulary
selected_words_add={'hugtoken','hugtokencraft'}

#parameters
current_directory = os.getcwd()
# Define the path where you want to save the tokenizer
tokenizer_path = os.path.join(current_directory,"ModifiedTokenizer")


#expand vocabulary
modified_tokenizer=editor.expand_vocabulary(tokenizer,selected_words_add)
tokenizer_path=editor.save_tokenizer(modified_tokenizer,tokenizer_path,model_max_length=None,isreduced=False)
modified_tokenizer=editor.load_tokenizer(type(tokenizer),tokenizer_path)

#check
new_vocab_size=len(modified_tokenizer)
print(f"new vocab size: {new_vocab_size}")

Notebook Example

You can also run the Python jupyter notebook examples directly by running example_notebook.ipynb

Documentation

get_top_tokens()

Obtains the k most frequently used tokens from tokenizer vocabulary.

Syntex

token_set=get_top_tokens(tokenizer,k)

Parameters

  • tokenizer: BertTokenizer
    • Pre-trained Bert Tokenizer
  • k: int
    • Desired number of tokens

Returns

  • token_list: set
    • Set of k most frequent tokens

expand_vocabulary()

Adds a set of new tokens to the vocabulary

Syntex

modified_tokenizer=expand_vocabulary(tokenizer,tokens_to_add)

Parameters

  • tokenizer: BertTokenizer
    • Pre-trained Bert Tokenizer
  • tokens_to_add: set
    • Set of tokens to add

Returns

  • modified_tokenizer: BertTokenizer
    • Modified Bert Tokenizer

reduce_vocabulary()

Removes all tokens execpt the given set of tokens from vocabulary

Syntex

modified_tokenizer=reduce_vocabulary(tokenizer,tokens_to_keep)

Parameters

  • tokenizer: BertTokenizer
    • Pre-trained Bert Tokenizer
  • tokens_to_keep: set
    • Set of tokens to keep

Returns

  • modified_tokenizer: BertTokenizer
    • Modified Bert Tokenizer

save_tokenizer()

Saves the modified tokenizer for use

Syntex

tokenizer_path=save_tokenizer(tokenizer,tokenizer_path,model_max_length=None,isreduced=True)

Parameters

  • tokenizer: BertTokenizer
    • Pre-trained Bert Tokenizer
  • tokenizer_path: str
    • Location path to save the tokenizer
  • model_max_length: int
    • New value of maximum token length
    • Defaults to None which means no change
  • isreduced: bool
    • Whether the modified tokenizer was reduced
    • True if vocabulary was reduced (Default)
    • False if vocabulary was expanded

Returns

  • tokenizer_path: str
    • Location path to save the tokenizer

load_tokenizer()

Loads a tokenizer from a given path

Syntex

tokenizer=load_tokenizer(tokenizer_class,tokenizer_path)

Parameters

  • tokenizer_class: type
    • Class type of Tokenizer
  • tokenizer_path: str
    • Location path to save the tokenizer

Returns

  • tokenizer: tokenizer_class
    • Tokenizer

validate_tokenizer()

Simple sanity check for tokenizer

Syntex

is_pass=validate_tokenizer(tokenizer)

Parameters

  • tokenizer: BertTokenizer
    • Pre-trained Bert Tokenizer

Returns

  • is_pass: bool
    • Valication result
    • True: validation passed
    • False: Validation failed

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

We welcome contributions!