A SentencePiece based tokenizer for production uses with AI21's models
- If you wish to use the tokenizers for
Jamba 1.5 Mini
orJamba 1.5 Large
, you will need to request access to the relevant model's HuggingFace repo:
pip install ai21-tokenizer
poetry add ai21-tokenizer
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
Another way would be to use our Jamba 1.5 Mini tokenizer directly:
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
Another way would be to use our Jamba 1.5 Large tokenizer directly:
from ai21_tokenizer import Jamba1_5Tokenizer
model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
Another way would be to use our Jamba tokenizer directly:
from ai21_tokenizer import JambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here
from ai21_tokenizer import Tokenizer, PreTrainedTokenizers
tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here
Another way would be to use our async Jamba tokenizer class method create:
from ai21_tokenizer import AsyncJambaInstructTokenizer
model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here
from ai21_tokenizer import Tokenizer
tokenizer = Tokenizer.get_tokenizer()
# Your code here
Another way would be to use our Jurassic model directly:
from ai21_tokenizer import JurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)
from ai21_tokenizer import Tokenizer
tokenizer = await Tokenizer.get_async_tokenizer()
# Your code here
Another way would be to use our async Jamba tokenizer class method create:
from ai21_tokenizer import AsyncJurassicTokenizer
model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here
These functions allow you to encode your text to a list of token ids and back to plaintext
text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")
decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")
tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
For more examples, please see our examples folder.