Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cli): Add token counts for CLI outputs #39

Merged
merged 1 commit into from
Aug 8, 2024

Conversation

joshellington
Copy link
Contributor

This PR adds support for token counting and CLI display using the tiktoken library. Provides token counts for individual files and the entire repository, which is useful for working within context limits.

Currently set to use cl100k_base as the encoder (used for GPT-4, etc.). Could add an option for customizing which encoder to use, but felt like the cl100k_base is "good enough" (getting >95% token count accuracy has very limited value IMO) and limits option scope.

Decided on the tiktoken library based on high-level benchmarks, higher NPM weekly downloads, and fairly decent commit activity.

Changes

  • Added tiktoken as a dependency
  • Updated src/core/packager.ts to include token counting
  • Modified src/cli/cliOutput.ts to display token information in the summary and top files list
  • Updated src/cli/index.ts to use the new token information
  • Added a new test file tests/core/tokenCounter.test.ts for token counting functionality
  • Updated README.md

@joshellington
Copy link
Contributor Author

@yamadashy think it fits fairly well within the existing structure. But open to challenges and ideas on how to expand!

@yamadashy
Copy link
Owner

@joshellington
Thank you so much for this amazing contribution! This looks like a fantastic addition to repopack.
I've taken a quick look, and my impression is very positive. The token counting feature seems like it will be extremely useful for users working with AI models that have context limits.

Regarding the config options, I completely agree with your approach. If we find that additional options become necessary in the future, we can always add them then. For now, using cl100k_base as the default encoder makes sense.
I'm about to start my workday, so it might take me a bit of time to do a review. I'll make sure to take a closer look as soon as I can and provide more detailed feedback.

Thank you again for this great contribution!

@yamadashy
Copy link
Owner

@joshellington
I've taken another look, and this is perfect! The implementation works great.

I'm planning to merge this PR as is. If you're okay with it, I'll merge and release it as v0.1.22.
Let me know if you have any final thoughts. Otherwise, I'll proceed with the release soon.

Thanks again for this excellent contribution!

@yamadashy yamadashy merged commit 234c6cd into yamadashy:main Aug 8, 2024
20 checks passed
@yamadashy
Copy link
Owner

@joshellington
I've merged this PR and released it as v0.1.22.

You can find the release notes here:
https://github.com/yamadashy/repopack/releases/tag/v0.1.22

Thank you again for your valuable contribution to repopack!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants