-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train BasicTokenizer on GPU with PyTorch, 100x speedup #38
base: master
Are you sure you want to change the base?
Conversation
Using an H100 and int16, it's now 108x speedup over the original implementation on M2 air |
Ok I'll step through this soon to take a look. |
Thanks for the feedback! I made the diff more surgical. Now the only added files are:
And the following files are lightly modified:
|
The following files are added:
merge_torch
BasicTokenizerTorch
, overrides thetrain
andencode
methods ofBasicTokenizer
RegexTokenizerTorch
, overrides theencode_ordinary
method ofRegexTokenizer
GPT4TokenizerTorch
, mostly inherits fromGPT4Tokenizer
, but usesRegexTokenizerTorch
'sencode
methodBasicTokenizerTorch
The following files are modified:
It takes 67.4 seconds on an H100 80GB SXM5 to train the
BasicTokenizerTorch
with a vocab_size of 512 on 308MB of Enron emails. The original code takes 2hrs 15min on an M2 Air with Python 3.11 to do this.I'm not sure if
RegexTokenizerTorch
orGPT4TokenizerTorch
can benefit much from pytorch since there are many chunks of varying lengths, i.e. a "ragged tensor". These tokenizers are helpful for sanity checks though. For example, thetest_gpt4_tiktoken_equality
tests all pass suggesting thatmerge_torch
is correctly implemented.I also made a new repository minbpe-pytorch in case adding pytorch support is beyond the scope of this project.