Community Resource: AutoTikTokenizer - A Bridge Between TikToken and HuggingFace Tokenizers #358

bhavnicksm · 2024-11-07T08:44:05Z

Hi TikToken team! 👋

I wanted to share a community resource that might be helpful for TikToken users who also work with HuggingFace tokenizers. I've created AutoTikTokenizer, a lightweight library that allows loading any HuggingFace tokenizer as a TikToken-compatible encoder.

What it does:

Enables using TikToken's fast tokenization with any HuggingFace tokenizer
Preserves exact encoding/decoding compatibility with original tokenizers
Simple drop-in usage similar to HuggingFace's AutoTokenizer

Quick example:

from autotiktokenizer import AutoTikTokenizer

# Load any HF tokenizer as a TikToken encoder
encoder = AutoTikTokenizer.from_pretrained('gpt2')
tokens = encoder.encode("Hello world!")
text = encoder.decode(tokens)

The library is available on PyPI (pip install autotiktokenizer) and is fully open source at: https://github.com/bhavnicksm/autotiktokenizer

I've tested it with several popular models including GPT-2, LLaMA, Mistral, and others. I hope this helps TikToken users who want to work with a broader range of tokenizers while keeping TikToken's performance benefits!

Feel free to check it out if you think it would be useful for the community. Happy to hear any feedback or suggestions!

[Note: This is purely a community contribution - I'm not affiliated with the TikToken team]

idruker-cerence · 2024-12-15T01:50:07Z

Dear Bhavnick Minhas!

For months I've been searching for any documentation describing the format of "vocab" section of tokenizer.json or any sane code showing how to interpret it. Your code is a perfect example. Where have you been so long? I am so thankful to you for your work!

bhavnicksm · 2024-12-15T10:02:39Z

Hey @idruker-cerence!

I'm glad to hear that~ 😊

Please let me know if you have any questions on the implementation details as well, happy to clarify and share resources.

And, always open to feedback!

Thanks! ☺️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community Resource: AutoTikTokenizer - A Bridge Between TikToken and HuggingFace Tokenizers #358

Community Resource: AutoTikTokenizer - A Bridge Between TikToken and HuggingFace Tokenizers #358

bhavnicksm commented Nov 7, 2024

idruker-cerence commented Dec 15, 2024 •

edited

Loading

bhavnicksm commented Dec 15, 2024

Community Resource: AutoTikTokenizer - A Bridge Between TikToken and HuggingFace Tokenizers #358

Community Resource: AutoTikTokenizer - A Bridge Between TikToken and HuggingFace Tokenizers #358

Comments

bhavnicksm commented Nov 7, 2024

idruker-cerence commented Dec 15, 2024 • edited Loading

bhavnicksm commented Dec 15, 2024

idruker-cerence commented Dec 15, 2024 •

edited

Loading