AI21 Labs Tokenizer

A SentencePiece based tokenizer for production uses with AI21's models

Prerequisites

If you wish to use the tokenizers for Jamba 1.5 Mini or Jamba 1.5 Large, you will need to request access to the relevant model's HuggingFace repo:
- Jamba 1.5 Mini
- Jamba 1.5 Large

Installation

pip

pip install ai21-tokenizer

poetry

poetry add ai21-tokenizer

Usage

Tokenizer Creation

Jamba 1.5 Mini Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here

Another way would be to use our Jamba 1.5 Mini tokenizer directly:

from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_MINI_TOKENIZER)
# Your code here

Jamba 1.5 Large Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here

Another way would be to use our Jamba 1.5 Large tokenizer directly:

from ai21_tokenizer import Jamba1_5Tokenizer

model_path = "<Path to your vocabs file>"
tokenizer = Jamba1_5Tokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_1_5_LARGE_TOKENIZER)
# Your code here

Jamba Instruct Tokenizer

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = Tokenizer.get_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here

Another way would be to use our Jamba tokenizer directly:

from ai21_tokenizer import JambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = JambaInstructTokenizer(model_path=model_path)
# Your code here

Async usage

from ai21_tokenizer import Tokenizer, PreTrainedTokenizers

tokenizer = await Tokenizer.get_async_tokenizer(PreTrainedTokenizers.JAMBA_INSTRUCT_TOKENIZER)
# Your code here

Another way would be to use our async Jamba tokenizer class method create:

from ai21_tokenizer import AsyncJambaInstructTokenizer

model_path = "<Path to your vocabs file>"
tokenizer = AsyncJambaInstructTokenizer.create(model_path=model_path)
# Your code here

J2 Tokenizer

from ai21_tokenizer import Tokenizer

tokenizer = Tokenizer.get_tokenizer()
# Your code here

Another way would be to use our Jurassic model directly:

from ai21_tokenizer import JurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)

Async usage

from ai21_tokenizer import Tokenizer

tokenizer = await Tokenizer.get_async_tokenizer()
# Your code here

Another way would be to use our async Jamba tokenizer class method create:

from ai21_tokenizer import AsyncJurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = AsyncJurassicTokenizer.create(model_path=model_path, config=config)
# Your code here

Functions

Encode and Decode

These functions allow you to encode your text to a list of token ids and back to plaintext

text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")

Async

# Assuming you have created an async tokenizer
text_to_encode = "apple orange banana"
encoded_text = await tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = await tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")

What if you had wanted to convert your tokens to ids or vice versa?

tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)

Async

# Assuming you have created an async tokenizer
tokens = await tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)

For more examples, please see our examples folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AI21 Labs Tokenizer

Prerequisites

Installation

pip

poetry

Usage

Tokenizer Creation

Jamba 1.5 Mini Tokenizer

Async usage

Jamba 1.5 Large Tokenizer

Async usage

Jamba Instruct Tokenizer

Async usage

J2 Tokenizer

Async usage

Functions

Encode and Decode

Async

What if you had wanted to convert your tokens to ids or vice versa?

Async

Files

README.md

Latest commit

History

README.md

File metadata and controls

AI21 Labs Tokenizer

Prerequisites

Installation

pip

poetry

Usage

Tokenizer Creation

Jamba 1.5 Mini Tokenizer

Async usage

Jamba 1.5 Large Tokenizer

Async usage

Jamba Instruct Tokenizer

Async usage

J2 Tokenizer

Async usage

Functions

Encode and Decode

Async

What if you had wanted to convert your tokens to ids or vice versa?

Async