Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert : fix handling of added tokens #3405

Closed

Conversation

cebtenzzre
Copy link
Collaborator

Some models, like MPT and GPT-NeoX, have spaces in added_tokens: mpt-7b tokenizer.json, gpt-neox-20b tokenizer.json

I have changed the byte decoding to not decode added tokens. I'm not sure what the original KeyError handler was meant to do, as it references multibyte characters, but I have not seen them in practice - just spaces.

See also: huggingface/transformers#1133

@goerch
Copy link
Collaborator

goerch commented Sep 30, 2023

AFAUI these changes conflict with the proposed fixes for Falcon and Aquila, which use GPT2-based tokenizers.

@cebtenzzre
Copy link
Collaborator Author

AFAUI these changes conflict with the proposed fixes for Falcon and Aquila, which use GPT2-based tokenizers.

I haven't read through your PR in depth, so I would appreciate if you could point to the part of the C++ code that should be changed to accomplish something similar there. Then I can make a PR on your repo.

@goerch
Copy link
Collaborator

goerch commented Sep 30, 2023

I haven't read through your PR in depth, so I would appreciate if you could point to the part of the C++ code that should be changed to accomplish something similar there. Then I can make a PR on your repo.

I think conversion should look similar to this (and yes byte_encoder/byte_decoder are probably obsolete here). The C++ code of #3252 should be compatible with such conversions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants