Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache for Encoding - Runtime Boosted by 12% #319

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Majdoddin
Copy link

This PR introduces a caching mechanism in _encode_ordinary_native(), which stores the tokens for each "piece" of text. When a piece of text is repeated, its tokens are retrieved from the cache instead of being tokenized again.

This results in a runtime improvement of over 12% (from 20.21s to 17.96s on a single CPU core) when encoding 100MB of Linux source code as a single text.

The cache hit ratio is very high, approximately 95%. The final cache size is only 0.5% of the total number of pieces (218,450 vs. 39,769,721).

TODO:

  • Despite the 95% cache hit ratio, the expected runtime boost was not fully realized. This is because 80% of the loop runtime in the current code is spent splitting the text using regex. While this PR makes the tokenization logic 65% faster, the BIG gain can be achieved by optimizing the text splitting, possibly through multithreading.
  • Investigate declaring the cache in the struct CoreBPE so that it can be utilized across subsequent calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant