Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pad_token_id of -1 is not understood by convert.py #2981

Closed
cebtenzzre opened this issue Sep 3, 2023 · 14 comments · Fixed by #2984 or #4489
Closed

pad_token_id of -1 is not understood by convert.py #2981

cebtenzzre opened this issue Sep 3, 2023 · 14 comments · Fixed by #2984 or #4489

Comments

@cebtenzzre
Copy link
Collaborator

I tried to convert Chronoboros-33B to GGUF with convert.py (commit 3358c38), and I got this error:

gguf: Setting special token type bos to 0
gguf: Setting special token type eos to 1
gguf: Setting special token type pad to -1
Traceback (most recent call last):
  File "/home/cebtenzzre/src/forks/llama.cpp/./convert.py", line 1201, in <module>
    main()
  File "/home/cebtenzzre/src/forks/llama.cpp/./convert.py", line 1196, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab, concurrency = args.concurrency)
  File "/home/cebtenzzre/src/forks/llama.cpp/./convert.py", line 922, in write_all
    of.add_meta_special_vocab(svocab)
  File "/home/cebtenzzre/src/forks/llama.cpp/./convert.py", line 866, in add_meta_special_vocab
    svocab.add_to_gguf(self.gguf)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 831, in add_to_gguf
    handler(tokid)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 756, in add_pad_token_id
    self.add_uint32(KEY_TOKENIZER_PAD_ID, id)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 484, in add_uint32
    self.add_val(val, GGUFValueType.UINT32)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 546, in add_val
    self.kv_data += struct.pack(pack_fmt, val)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: argument out of range

I didn't have this issue when I previously converted the model to GGML.

This only happens because tokenizer.json is not available. For instance, if you apply a LoRA to this model with TheBloke's script, you end up with a model folder that does contain tokenizer.json. Converting this model displays these special tokens instead, and does not fail:

gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type unk to 0
@cebtenzzre
Copy link
Collaborator Author

@henk717 Your model has an unusual config.json, you may want to look into that.

@KerfuffleV2
Copy link
Collaborator

I assume pad_token_id: -1 just means "no pad token". Would be a pretty easy fix.

To make it work right now you should be able to just delete that line from the JSON file.

@henk717
Copy link

henk717 commented Sep 3, 2023

@cebtenzzre my model's config file is a copy of https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4/blob/main/config.json

@KerfuffleV2
Copy link
Collaborator

@henk717 This issue should be fixed if you update to master.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Oct 11, 2023

I converted a copy of Airochronos (a similar merge by henk) from GGML to GGUF using the current code and the --metadata-dir option. With the 'main' example, it seems to render the BOS token as . With oobabooga's UI, it generates no output, regardless of prompt, likely indicating some confusion about the EOS token.

I think the values in config.json are completely wrong, and should actually not be silently accepted.

@cebtenzzre cebtenzzre reopened this Oct 11, 2023
@KerfuffleV2
Copy link
Collaborator

There's no special metadata-handling code in the GGML to GGUF converter, it just delegates that part to convert.py, so you should get the same issue whether you're converting a GGML file with metadata specified or just converting from scratch with convert.py.

Or are you saying this only happens with GGML to GGUF but not converting the model from the original files?

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Oct 11, 2023

you should get the same issue whether you're converting a GGML file with metadata specified or just converting from scratch with convert.py.

I assume it happens in both cases. I don't think I ever tested the fix with convert.py and Chronoboros because I ended up changing the config.json to have the correct values. I forgot to do that this time, and ran into this issue again.

This is how llama.cpp prints the special tokens for this broken version of Airochronos:

llm_load_print_meta: BOS token = 0 '<unk>'
llm_load_print_meta: EOS token = 1 '<s>'
llm_load_print_meta: UNK token = 0 '<unk>'

@KerfuffleV2
Copy link
Collaborator

Do you have a link for the model so I can look at the exact metadata? (I'm definitely not an expert on this so it may or may not help. Most of my interaction with this stuff has just been trying to clean up existing code which didn't really require understanding why stuff was set up the way it was.)

@cebtenzzre
Copy link
Collaborator Author

Do you have a link for the model so I can look at the exact metadata?

https://huggingface.co/Henk717/airochronos-33B

@monatis
Copy link
Collaborator

monatis commented Oct 12, 2023

It's also set to null in tokenizer_config.json

In this case pad should be set to eos I guess.

But I doubt that this is something that we're supposed to handle in convert.py --model convertors should handle such cases themselves.

@KerfuffleV2
Copy link
Collaborator

Not a full answer, but I remember at some point changing the code to just treat id -1 like the field wasn't set. (Originally it would just crash on that.)

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Oct 12, 2023

This model is only compatible with HF transformers because it reads the EOS token string from special_tokens_map.json and then looks up the token in the vocabulary to get the ID. Should we do that, too? Otherwise, I could just revert #2984 and blame the models.

@KerfuffleV2
Copy link
Collaborator

Otherwise, I could just revert #2984 and blame the models.

You probably don't want to just revert it. The original behavior was to just crash when trying to convert the model. If you want to just refuse to convert models like that, more friendly handling with a reasonable error message would be better.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Oct 22, 2023

AFAIK eos_token_id from config.json is not even used outside of warn_if_padding_and_no_attention_mask, so we probably shouldn't use it at all. tokenizer_config.json appears to be the new place for it:
https://github.com/huggingface/transformers/blob/093848d3ccf3884caf048718b6bae833da0edb94/src/transformers/tokenization_utils_base.py#L1934-L1941

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants