You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been working on a C version which is based on the GPT2 paper: bytephase
It provides a Python API with C extensions to significantly speed up training and encoding. The project has a thorough README and complete docstring for all class methods. bytephase will be a good addition considering the 2nd point in todo section: "write an even more optimized C or Rust version (think through)"
I'd like to add bytephase to the community extension section of the README, encouraging more developers to review and contribute to this implementation and possibly build more features into it (ex. GPT4 support, loading other pretrained tokenizers).
Example usage:
frombytephaseimportTokenizer# Initialize and traintokenizer=Tokenizer()
# OR select a custom regex pattern, defaults to the GPT2 patterncustom_pattern=r'\w+|\s+|[^\w\s]+'tokenizer=Tokenizer(pattern=custom_pattern)
tokenizer.train("path/to/your_data.txt", vocab_size=10000)
# Encodeencoded=tokenizer.encode("Hello, world!")
# [1869, 574, 111, 44, 1560, 33]# Decodedecoded=tokenizer.decode(encoded)
# "Hello, world!"# Save and loadtokenizer.save("saved_tokenizer")
tokenizer.load("saved_tokenizer.bpe")
The text was updated successfully, but these errors were encountered:
I've been working on a C version which is based on the GPT2 paper: bytephase
It provides a Python API with C extensions to significantly speed up training and encoding. The project has a thorough README and complete docstring for all class methods.
bytephase
will be a good addition considering the 2nd point intodo
section: "write an even more optimized C or Rust version (think through)"I'd like to add
bytephase
to the community extension section of the README, encouraging more developers to review and contribute to this implementation and possibly build more features into it (ex. GPT4 support, loading other pretrained tokenizers).Example usage:
The text was updated successfully, but these errors were encountered: