Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001

akashmjn · 2023-06-11T04:59:07Z

Patches the script to determine what type of tokenizer files are present and convert appropriately.

Mostly borrows from #725 as a reference. For the same reason mentioned in that PR, converted multilingual checkpoints will continue to exactly match while .en checkpoints have a 17 byte difference compared to ggml files downloaded by whisper.cpp.

The script will also now produce exactly the same files regardless of which type of tokenizer files were used for conversion.

-rw-r--r--  1 Akash  staff  487614184 Jun 10 21:30 ggml-small.en.hf.bin
-rw-r--r--  1 Akash  staff  487614184 Jun 10 21:28 ggml-small.en.tiktoken.bin
-rw-r--r--  1 Akash  staff  487601967 Jun 10 21:32 ggml-small.hf.bin
-rw-r--r--  1 Akash  staff  487601967 Jun 10 21:35 ggml-small.tiktoken.bin

…ransformers whisper tokenizer

ggerganov · 2023-06-25T10:50:34Z

Thank you!

….json tokenizer files (ggerganov#1001) * patch checkpoint convert script to keep compatibility with older hf_transformers whisper tokenizer * typo fix

akashmjn added 2 commits June 10, 2023 21:36

patch checkpoint convert script to keep compatibility with older hf_t…

e979dbb

…ransformers whisper tokenizer

typo fix

06022a1

ggerganov approved these changes Jun 25, 2023

View reviewed changes

ggerganov merged commit 3ec7bff into ggerganov:master Jun 25, 2023

akashmjn mentioned this pull request Jun 27, 2023

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058

Merged

7 tasks

akashmjn mentioned this pull request Aug 16, 2023

tdrz and coreml support? #1088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001

Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001

akashmjn commented Jun 11, 2023

ggerganov commented Jun 25, 2023

Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001

Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001

Conversation

akashmjn commented Jun 11, 2023

ggerganov commented Jun 25, 2023