`encode` and `decode` of the GPT2Tokenizer are not mutually inverse #11

comydream · 2022-03-12T03:54:32Z

I met a problem that the encode and decode of the GPT2Tokenizer are not mutually inverse.

For example, Alice's ["\u0120compar": 4616, "isons": 9886] may decoded as ["\u0120comparisons": 17909] by Bob, so that Bob can't recover the message correctly.

(The mapping table can be viewed from https://huggingface.co/gpt2/raw/main/vocab.json.)

There are more examples:

"\u0120.": 764
".": 13

","": 553
"\u0120,"": 42911

"\u0120te": 573
"ction": 596
"\u0120t": 256
"ection": 3213

"\u0120INT": 17828
"eq": 27363
"\u0120IN": 3268
"Te": 6767
"q": 80

In my environment, the versions of the packages are the same as you said.

pytorch_transformers==1.1.0
torch==1.0.1
bitarray==1.0.1

So, I'd like to ask you how to solve this problem.

Besides, I noticed that you rewrite GPT2Tokenizer.decode and GPT2Tokenizer._convert_token_to_id. Is it related to the problem?

Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`encode` and `decode` of the GPT2Tokenizer are not mutually inverse #11

`encode` and `decode` of the GPT2Tokenizer are not mutually inverse #11

comydream commented Mar 12, 2022

encode and decode of the GPT2Tokenizer are not mutually inverse #11

encode and decode of the GPT2Tokenizer are not mutually inverse #11

Comments

comydream commented Mar 12, 2022

`encode` and `decode` of the GPT2Tokenizer are not mutually inverse #11

`encode` and `decode` of the GPT2Tokenizer are not mutually inverse #11