Skip to content

Commit

Permalink
Update method documentation and readme.md file
Browse files Browse the repository at this point in the history
  • Loading branch information
Gioni06 committed Apr 16, 2023
1 parent 9d2c827 commit 86f824f
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 5 deletions.
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,5 +87,36 @@ $numberOfTokens = $tokenizer->count($text);
// 4
```

## Encode a given text into chunks of tokens, with each chunk containing a specified maximum number of tokens.

This method is useful when handling large texts that need to be divided into smaller chunks for further processing.


```php
use Gioni06\Gpt3Tokenizer\Gpt3TokenizerConfig;
use Gioni06\Gpt3Tokenizer\Gpt3Tokenizer;

$config = new Gpt3TokenizerConfig();
$tokenizer = new Gpt3Tokenizer($config);
$text = "1 2 hello,world 3 4";
$tokenizer->encodeInChunks($text, 5)
// [[16, 362, 23748], [171, 120, 234, 6894, 513], [604]]
```

## Takes a given text and chunks it into encoded segments, with each segment containing a specified maximum number of tokens.

This method leverages the encodeInChunks method for encoding the text into Byte-Pair Encoded (BPE) tokens and then decodes these tokens back into text.

```php
use Gioni06\Gpt3Tokenizer\Gpt3TokenizerConfig;
use Gioni06\Gpt3Tokenizer\Gpt3Tokenizer;

$config = new Gpt3TokenizerConfig();
$tokenizer = new Gpt3Tokenizer($config);
$text = "1 2 hello,world 3 4";
$tokenizer->chunk($text, 5)
// ['1 2 hello', ',world 3', ' 4']
```

## License
This project uses the Apache License 2.0 license. See the [LICENSE](LICENSE) file for more information.
15 changes: 10 additions & 5 deletions src/Gpt3Tokenizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -452,10 +452,11 @@ public function encode(string $text): array
}

/**
* Split $text into chunks of up to $maxTokenPerChunk tokens.
* Unicode characters that map to multiple tokens are kept in the same chunk.
*
* @return array<array<int>>
* Encodes a given text into chunks of Byte-Pair Encoded (BPE) tokens, with each chunk containing a specified
* maximum number of tokens.
* @param string $text The input text to be encoded.
* @param int $maxTokenPerChunk The maximum number of tokens allowed per chunk.
* @return int[][] An array of arrays containing BPE token chunks.
*/
public function encodeInChunks(string $text, int $maxTokenPerChunk): array
{
Expand Down Expand Up @@ -491,7 +492,11 @@ public function encodeInChunks(string $text, int $maxTokenPerChunk): array
}

/**
* @return array<string>
* Takes a given text and chunks it into encoded segments, with each segment containing a specified maximum
* number of tokens.
* @param string $text The input text to be encoded.
* @param int $maxTokenPerChunk The maximum number of tokens allowed per chunk.
* @return string[] An array of strings containing the encoded text.
*/
public function chunk(string $text, int $maxTokenPerChunk): array
{
Expand Down

0 comments on commit 86f824f

Please sign in to comment.