convert : fix handling of added tokens #3405

cebtenzzre · 2023-09-29T22:04:37Z

Some models, like MPT and GPT-NeoX, have spaces in added_tokens: mpt-7b tokenizer.json, gpt-neox-20b tokenizer.json

I have changed the byte decoding to not decode added tokens. I'm not sure what the original KeyError handler was meant to do, as it references multibyte characters, but I have not seen them in practice - just spaces.

See also: huggingface/transformers#1133

goerch · 2023-09-30T13:49:48Z

AFAUI these changes conflict with the proposed fixes for Falcon and Aquila, which use GPT2-based tokenizers.

cebtenzzre · 2023-09-30T17:07:50Z

AFAUI these changes conflict with the proposed fixes for Falcon and Aquila, which use GPT2-based tokenizers.

I haven't read through your PR in depth, so I would appreciate if you could point to the part of the C++ code that should be changed to accomplish something similar there. Then I can make a PR on your repo.

goerch · 2023-09-30T18:05:02Z

I haven't read through your PR in depth, so I would appreciate if you could point to the part of the C++ code that should be changed to accomplish something similar there. Then I can make a PR on your repo.

I think conversion should look similar to this (and yes byte_encoder/byte_decoder are probably obsolete here). The C++ code of #3252 should be compatible with such conversions.

cebtenzzre added 2 commits September 29, 2023 17:41

convert : use bytes_to_unicode from transformers

846c51f

convert : fix handling of added tokens

8c8b5b0

cebtenzzre requested a review from klosax September 29, 2023 22:04

This was referenced Sep 30, 2023

Work on the BPE tokenizer #3252

Merged

convert : fix vocab size when not defined in hparams #3421

Merged

MPT support in llama.cpp #3417

Merged

cebtenzzre added 3 commits October 1, 2023 17:02

s/else if/elif/

f18cfea

fix typo

02fbbf9

Merge upstream into fix-convert-added-tokens

1115a09

cebtenzzre closed this Oct 4, 2023

cebtenzzre mentioned this pull request Oct 12, 2023

MPT: tokenization crashes #3604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert : fix handling of added tokens #3405

convert : fix handling of added tokens #3405

cebtenzzre commented Sep 29, 2023

goerch commented Sep 30, 2023

cebtenzzre commented Sep 30, 2023

goerch commented Sep 30, 2023

convert : fix handling of added tokens #3405

convert : fix handling of added tokens #3405

Conversation

cebtenzzre commented Sep 29, 2023

goerch commented Sep 30, 2023

cebtenzzre commented Sep 30, 2023

goerch commented Sep 30, 2023