tiktoken-chatml

Adding support for ChatML chat template to tiktoken tokenizers:

Remap or remove OpenAI special tokens to support only ChatML special tokens: <|im_start|>, <|im_end|>;
Always maintain the original vocuabulary size if possible;
Add apply_chat_template method known from HF tokenizers;
Maintain full functionality of tiktoken tokenizer.

Use for training models from scratch. For your model safety - recheck all changes before using.

Installation

pip install tiktoken-chatml

Quickstart

import tiktoken_chatml

enc = tiktoken_chatml.get_encoding("cl100k_base-chatml")

output = enc.apply_chat_template(
    [
        {"role": "system", "content": "This is a system message."},
        {"role": "user", "content": "Hello!"},
    ],
    tokenize=False,
)
print(output)

Output:

<|im_start|>system
This is a system message
<|im_end|>
<|im_start|>user
Hello!
<|im_end|>

Setting tokenize=True invokes tiktoken encoding.encode().

You can use this encoding as a drag and drop replacement for tiktoken.

Supported encodings:

SUPPORTED_ENCODINGS = ["o200k_base-chatml", "cl100k_base-chatml", "gpt2-chatml"]

The eot_token is now <|im_end|>:

>> enc.eot_token == enc.encode("<|im_end|>", allowed_special="all")[0]
True

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
tiktoken_chatml		tiktoken_chatml
.gitignore		.gitignore
README.md		README.md
example.py		example.py
hf_wrapper.py		hf_wrapper.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiktoken-chatml

Installation

Quickstart

About

Releases

Packages

Languages

melisa-writer/tiktoken-chatml

Folders and files

Latest commit

History

Repository files navigation

tiktoken-chatml

Installation

Quickstart

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages