Skip to content

openscilab/tocount

ToCount Logo

ToCount: Lightweight Token Estimator


PyPI version built with Python3 GitHub repo size Discord Channel

Overview

ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.

PyPI Counter
Github Stars
Branch main dev
CI
Code Quality CodeFactor

Installation

PyPI

Source code

Models

Rule-Based

Model Name MAE RMSE MedAE
RULE_BASED.UNIVERSAL 0.8175 106.70 617.78 18 0.6377
RULE_BASED.GPT_3_5 0.7266 152.34 756.17 35 0.4828
RULE_BASED.GPT_4 0.6878 161.93 808.04 40 0.4502

Tiktoken R50K

Model Name MAE RMSE MedAE
TIKTOKEN_R50K.LINEAR_ALL 0.7334 152.39 733.40 28.55 0.4826
TIKTOKEN_R50K.LINEAR_ENGLISH 0.8703 62.76 508.20 8.87 0.7287

Tiktoken CL100K

Model Name MAE RMSE MedAE
TIKTOKEN_CL100K.LINEAR_ALL 0.9127 64.09 298.02 15.73 0.6804
TIKTOKEN_CL100K.LINEAR_ENGLISH 0.9711 27.43 185.07 6.34 0.8527

Tiktoken O200K

Model Name MAE RMSE MedAE
TIKTOKEN_O200K.LINEAR_ALL 0.9563 38.23 197.16 9.70 0.7818
TIKTOKEN_O200K.LINEAR_ENGLISH 0.9730 26.00 177.54 5.96 0.8581

Deepseek R1

Model Name MAE RMSE MedAE
DEEPSEEK_R1.LINEAR_ALL 0.9531 40.66 212.11 10.71 0.7741
DEEPSEEK_R1.LINEAR_ENGLISH 0.9696 28.44 192.36 6.36 0.8477

Qwen QwQ

Model Name MAE RMSE MedAE
QWEN_QWQ.LINEAR_ALL 0.9342 45.50 257.97 12.17 0.7542
QWEN_QWQ.LINEAR_ENGLISH 0.9570 29.06 236.10 6.68 0.8457

Llama 3.1

Model Name MAE RMSE MedAE
LLAMA_3_1.LINEAR_ALL 0.9538 44.37 207.58 11.70 0.7578
LLAMA_3_1.LINEAR_ENGLISH 0.9731 26.59 177.94 6.24 0.8564

ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].

Usage

>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4

Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to tocount@openscilab.com.

  • Please complete the issue template

You can also join our discord server

Discord Channel

References

1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.
2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.

Show your support

Star this repo

Give a ⭐️ if this project helped you!

Donate to our project

If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .

ToCount Donation

About

ToCount: Lightweight Token Estimator

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages