gotok is a lightweight command-line utility designed to count tokens in text using OpenAI's tokenization standards, making it easy to estimate input sizes for models like GPT-4o and older.
- Supports multiple OpenAI models and encodings.
- Flexible input and output options.
- Quiet mode for streamlined token counting.
- Go 1.19 or higher
Clone the repository and build the binary using Go:
git clone https://github.com/mattjoyce/gotok.git
cd gotok
go install
This will compile the gotok
binary and place it in your $GOPATH/bin
directory, making it accessible from your command line.
gotok [options] < [input]
-
--model string
Specifies the model to use for token embedding (default:gpt-4o
). If set, the model's default encoding will be applied. -
--encoding string
Sets a specific encoding manually, overriding the model's default. Must be one of the listed encodings (e.g.,cl100k_base
). -
--input string
File path to the input text file. If both--input
and stdin are provided,gotok
will concatenate their contents. -
--output string
Designates where to send the output. Choose betweenstderr
,stdout
, or specify a file path (default:stderr
). -
--passthrough
Outputs the original input text tostdout
in addition to the token count. Set tofalse
by default. -
--quiet
Quiet mode suppresses all output except for the token count, overriding--passthrough
. -
--list
Lists the available models and encodings, providing an easy reference for compatible options.
Here is a list of available models and their associated encodings that can be used with gotok:
code-search-ada-code-001
(r50k_base)code-davinci-001
(p50k_base)text-embedding-ada-002
(cl100k_base)text-similarity-curie-001
(r50k_base)code-search-babbage-code-001
(r50k_base)text-ada-001
(r50k_base)code-davinci-002
(p50k_base)text-embedding-3-small
(cl100k_base)text-davinci-002
(p50k_base)ada
(r50k_base)cushman-codex
(p50k_base)code-davinci-edit-001
(p50k_edit)text-search-davinci-doc-001
(r50k_base)text-babbage-001
(r50k_base)code-cushman-002
(p50k_base)code-cushman-001
(p50k_base)text-similarity-davinci-001
(r50k_base)text-similarity-babbage-001
(r50k_base)text-similarity-ada-001
(r50k_base)text-search-ada-doc-001
(r50k_base)gpt2
(gpt2)davinci
(r50k_base)babbage
(r50k_base)text-search-curie-doc-001
(r50k_base)curie
(r50k_base)davinci-codex
(p50k_base)text-davinci-edit-001
(p50k_edit)text-embedding-3-large
(cl100k_base)gpt-4
(cl100k_base)gpt-3.5-turbo
(cl100k_base)text-curie-001
(r50k_base)text-search-babbage-doc-001
(r50k_base)gpt-4o
(o200k_base)text-davinci-003
(p50k_base)text-davinci-001
(r50k_base)
-
Basic Usage
Count tokens ininput.txt
, displaying output instderr
(default):gotok --input input.txt
-
Using Stdin Input
Pipe input text from stdin:gotok < input.txt
-
Specify Encoding and Model
Specify a particular encoding and model:gotok --encoding "cl100k_base" --model "gpt-4" < input.txt
-
Quiet Mode
Count tokens only (suppress all other output):gotok --quiet --input input.txt
-
List Available Models and Encodings
gotok --list
- When both
--input
and stdin input are provided, the content from both sources will be concatenated before processing. - Use the
--passthrough
option to print the original input text tostdout
while viewing token counts.
This utility simplifies text preprocessing tasks for token-based models by giving you quick insight into input size, which helps in better managing prompt limits and structuring inputs for optimal model performance.