GPT3Tokenizer for PHP

This is a PHP port of the GPT-3 tokenizer. It is based on the original Python implementation and the Nodejs implementation.

GPT-2 and GPT-3 use a technique called byte pair encoding to convert text into a sequence of integers, which are then used as input for the model. When you interact with the OpenAI API, you may find it useful to calculate the amount of tokens in a given text before sending it to the API.

If you want to learn more, read the Summary of the tokenizers from Hugging Face.

tl;dr 🤖

There is a Custom GPT for ChatGPT that can help you use this package in your software

Support ⭐️

If you find my work useful, I would be thrilled if you could show your support by giving this project a star ⭐️. It only takes a second and it would mean a lot to me. Your star will not only make me feel warm and fuzzy inside, but it will also help reach more people who can benefit from this project.

Installation

Install the package from Packagist using Composer:

composer require gioni06/gpt3-tokenizer

Testing

Loading the vocabulary files consumes a lot of memory. You might need to increase the phpunit memory limit. https://stackoverflow.com/questions/46448294/phpunit-coverage-allowed-memory-size-of-536870912-bytes-exhausted

./vendor/bin/phpunit -d memory_limit=-1 tests/

Use the configuration Class

use Gioni06\Gpt3Tokenizer\Gpt3TokenizerConfig;

// default vocab path
// default merges path
// caching enabled
$defaultConfig = new Gpt3TokenizerConfig();

$customConfig = new Gpt3TokenizerConfig();
$customConfig
    ->vocabPath('custom_vocab.json') // path to a custom vocabulary file
    ->mergesPath('custom_merges.txt') // path to a custom merges file
    ->useCache(false)

A note on caching

The tokenizer will try to use apcu for caching, if that is not available it will use a plain PHP array. You will see slightly better performance for long texts when using the cache. The cache is enabled by default.

Encode a text

use Gioni06\Gpt3Tokenizer\Gpt3TokenizerConfig;
use Gioni06\Gpt3Tokenizer\Gpt3Tokenizer;

$config = new Gpt3TokenizerConfig();
$tokenizer = new Gpt3Tokenizer($config);
$text = "This is some text";
$tokens = $tokenizer->encode($text);
// [1212,318,617,2420]

Decode a text

use Gioni06\Gpt3Tokenizer\Gpt3TokenizerConfig;
use Gioni06\Gpt3Tokenizer\Gpt3Tokenizer;

$config = new Gpt3TokenizerConfig();
$tokenizer = new Gpt3Tokenizer($config);
$tokens = [1212,318,617,2420]
$text = $tokenizer->decode($tokens);
// "This is some text"

Count the number of tokens in a text

use Gioni06\Gpt3Tokenizer\Gpt3TokenizerConfig;
use Gioni06\Gpt3Tokenizer\Gpt3Tokenizer;

$config = new Gpt3TokenizerConfig();
$tokenizer = new Gpt3Tokenizer($config);
$text = "This is some text";
$numberOfTokens = $tokenizer->count($text);
// 4

Encode a given text into chunks of tokens, with each chunk containing a specified maximum number of tokens.

This method is useful when handling large texts that need to be divided into smaller chunks for further processing.

use Gioni06\Gpt3Tokenizer\Gpt3TokenizerConfig;
use Gioni06\Gpt3Tokenizer\Gpt3Tokenizer;

$config = new Gpt3TokenizerConfig();
$tokenizer = new Gpt3Tokenizer($config);
$text = "1 2 hello，world 3 4";
$tokenizer->encodeInChunks($text, 5)
// [[16, 362, 23748], [171, 120, 234, 6894, 513], [604]]

Takes a given text and chunks it into encoded segments, with each segment containing a specified maximum number of tokens.

This method leverages the encodeInChunks method for encoding the text into Byte-Pair Encoded (BPE) tokens and then decodes these tokens back into text.

use Gioni06\Gpt3Tokenizer\Gpt3TokenizerConfig;
use Gioni06\Gpt3Tokenizer\Gpt3Tokenizer;

$config = new Gpt3TokenizerConfig();
$tokenizer = new Gpt3Tokenizer($config);
$text = "1 2 hello，world 3 4";
$tokenizer->chunk($text, 5)
// ['1 2 hello', '，world 3', ' 4']

License

This project uses the Apache License 2.0 license. See the LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT3Tokenizer for PHP

tl;dr 🤖

Support ⭐️

Installation

Testing

Use the configuration Class

A note on caching

Encode a text

Decode a text

Count the number of tokens in a text

Encode a given text into chunks of tokens, with each chunk containing a specified maximum number of tokens.

Takes a given text and chunks it into encoded segments, with each segment containing a specified maximum number of tokens.

License

About

Releases 4

Packages

Contributors 2

Languages

License

Gioni06/GPT3Tokenizer

Folders and files

Latest commit

History

Repository files navigation

GPT3Tokenizer for PHP

tl;dr 🤖

Support ⭐️

Installation

Testing

Use the configuration Class

A note on caching

Encode a text

Decode a text

Count the number of tokens in a text

Encode a given text into chunks of tokens, with each chunk containing a specified maximum number of tokens.

Takes a given text and chunks it into encoded segments, with each segment containing a specified maximum number of tokens.

License

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Languages

Packages