Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python API with C extensions for faster training and encoding #85

Open
benarnav opened this issue Jun 27, 2024 · 0 comments
Open

Python API with C extensions for faster training and encoding #85

benarnav opened this issue Jun 27, 2024 · 0 comments

Comments

@benarnav
Copy link

I've been working on a C version which is based on the GPT2 paper: bytephase

It provides a Python API with C extensions to significantly speed up training and encoding. The project has a thorough README and complete docstring for all class methods.
bytephase will be a good addition considering the 2nd point in todo section: "write an even more optimized C or Rust version (think through)"

I'd like to add bytephase to the community extension section of the README, encouraging more developers to review and contribute to this implementation and possibly build more features into it (ex. GPT4 support, loading other pretrained tokenizers).

Example usage:

from bytephase import Tokenizer

# Initialize and train
tokenizer = Tokenizer()
# OR select a custom regex pattern, defaults to the GPT2 pattern
custom_pattern = r'\w+|\s+|[^\w\s]+'
tokenizer = Tokenizer(pattern=custom_pattern)

tokenizer.train("path/to/your_data.txt", vocab_size=10000)

# Encode
encoded = tokenizer.encode("Hello, world!")
# [1869, 574, 111, 44, 1560, 33]

# Decode
decoded = tokenizer.decode(encoded)
# "Hello, world!"

# Save and load
tokenizer.save("saved_tokenizer")
tokenizer.load("saved_tokenizer.bpe")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant