Batch encoding decoding #22

ezzaldeeen · 2024-02-20T13:21:31Z

Batch encoding and decoding:

from minbpe import BasicTokenizer
tokenizer = BasicTokenizer()
tokenizer.train(very_long_training_string, vocab_size=4096)
tokenizer.encode_batch(["hello world", "bye world"]) # list[string] -> list[tokens]
tokenizer.decode_batch([[1000, 2000, 3000], [1000, 2000, 3000]]) # list[tokens] -> list[string]

from minbpe import RegexTokenizer
tokenizer = RegexTokenizer()
tokenizer.train(very_long_training_string, vocab_size=32768)
tokenizer.encode_batch(["hello world", "bye world"]) # list[string] -> list[tokens]
tokenizer.decode_batch([[1000, 2000, 3000], [1000, 2000, 3000]]) # list[tokens] -> list[string]

karpathy · 2024-02-20T15:33:31Z

Is this part of some official API of tokenizers somewhere that you're trying to match? Otherwise if it's just a 2-line wrapper it's best done outside, in code and manually?

ezzaldeeen · 2024-02-20T15:40:20Z

agree with you. did it just to match the behavior of tiktoken

ezzaldeeen added 3 commits February 20, 2024 15:04

add batch encoding and decoding to the base class

5c39e7a

add batch apis to the readme examples

87e8f80

update regex tokenizer example

c838441

ezzaldeeen marked this pull request as draft February 20, 2024 14:42

different logic per tokenizers

b2b7256

ezzaldeeen marked this pull request as ready for review February 20, 2024 15:40

Merge branch 'master' into batch_encoding_decoding

f70e2a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch encoding decoding #22

Batch encoding decoding #22

ezzaldeeen commented Feb 20, 2024

karpathy commented Feb 20, 2024

ezzaldeeen commented Feb 20, 2024

Batch encoding decoding #22

Are you sure you want to change the base?

Batch encoding decoding #22

Conversation

ezzaldeeen commented Feb 20, 2024

karpathy commented Feb 20, 2024

ezzaldeeen commented Feb 20, 2024