pad_token_id of -1 is not understood by convert.py #2981

cebtenzzre · 2023-09-03T05:12:51Z

I tried to convert Chronoboros-33B to GGUF with convert.py (commit 3358c38), and I got this error:

gguf: Setting special token type bos to 0
gguf: Setting special token type eos to 1
gguf: Setting special token type pad to -1
Traceback (most recent call last):
  File "/home/cebtenzzre/src/forks/llama.cpp/./convert.py", line 1201, in <module>
    main()
  File "/home/cebtenzzre/src/forks/llama.cpp/./convert.py", line 1196, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab, concurrency = args.concurrency)
  File "/home/cebtenzzre/src/forks/llama.cpp/./convert.py", line 922, in write_all
    of.add_meta_special_vocab(svocab)
  File "/home/cebtenzzre/src/forks/llama.cpp/./convert.py", line 866, in add_meta_special_vocab
    svocab.add_to_gguf(self.gguf)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 831, in add_to_gguf
    handler(tokid)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 756, in add_pad_token_id
    self.add_uint32(KEY_TOKENIZER_PAD_ID, id)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 484, in add_uint32
    self.add_val(val, GGUFValueType.UINT32)
  File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 546, in add_val
    self.kv_data += struct.pack(pack_fmt, val)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: argument out of range

I didn't have this issue when I previously converted the model to GGML.

This only happens because tokenizer.json is not available. For instance, if you apply a LoRA to this model with TheBloke's script, you end up with a model folder that does contain tokenizer.json. Converting this model displays these special tokens instead, and does not fail:

gguf: Setting special token type bos to 1
gguf: Setting special token type eos to 2
gguf: Setting special token type unk to 0

The text was updated successfully, but these errors were encountered:

cebtenzzre · 2023-09-03T05:24:54Z

@henk717 Your model has an unusual config.json, you may want to look into that.

KerfuffleV2 · 2023-09-03T07:25:02Z

I assume pad_token_id: -1 just means "no pad token". Would be a pretty easy fix.

To make it work right now you should be able to just delete that line from the JSON file.

henk717 · 2023-09-03T15:00:56Z

@cebtenzzre my model's config file is a copy of https://huggingface.co/jondurbin/airoboros-33b-gpt4-1.4/blob/main/config.json

KerfuffleV2 · 2023-09-03T15:03:44Z

@henk717 This issue should be fixed if you update to master.

cebtenzzre · 2023-10-11T22:33:30Z

I converted a copy of Airochronos (a similar merge by henk) from GGML to GGUF using the current code and the --metadata-dir option. With the 'main' example, it seems to render the BOS token as ▅ . With oobabooga's UI, it generates no output, regardless of prompt, likely indicating some confusion about the EOS token.

I think the values in config.json are completely wrong, and should actually not be silently accepted.

KerfuffleV2 · 2023-10-11T22:37:18Z

There's no special metadata-handling code in the GGML to GGUF converter, it just delegates that part to convert.py, so you should get the same issue whether you're converting a GGML file with metadata specified or just converting from scratch with convert.py.

Or are you saying this only happens with GGML to GGUF but not converting the model from the original files?

cebtenzzre · 2023-10-11T22:41:06Z

you should get the same issue whether you're converting a GGML file with metadata specified or just converting from scratch with convert.py.

I assume it happens in both cases. I don't think I ever tested the fix with convert.py and Chronoboros because I ended up changing the config.json to have the correct values. I forgot to do that this time, and ran into this issue again.

This is how llama.cpp prints the special tokens for this broken version of Airochronos:

llm_load_print_meta: BOS token = 0 '<unk>'
llm_load_print_meta: EOS token = 1 '<s>'
llm_load_print_meta: UNK token = 0 '<unk>'

KerfuffleV2 · 2023-10-11T23:00:15Z

Do you have a link for the model so I can look at the exact metadata? (I'm definitely not an expert on this so it may or may not help. Most of my interaction with this stuff has just been trying to clean up existing code which didn't really require understanding why stuff was set up the way it was.)

cebtenzzre · 2023-10-11T23:19:12Z

Do you have a link for the model so I can look at the exact metadata?

https://huggingface.co/Henk717/airochronos-33B

monatis · 2023-10-12T00:00:54Z

It's also set to null in tokenizer_config.json

In this case pad should be set to eos I guess.

But I doubt that this is something that we're supposed to handle in convert.py --model convertors should handle such cases themselves.

KerfuffleV2 · 2023-10-12T19:11:04Z

Not a full answer, but I remember at some point changing the code to just treat id -1 like the field wasn't set. (Originally it would just crash on that.)

cebtenzzre · 2023-10-12T23:44:34Z

This model is only compatible with HF transformers because it reads the EOS token string from special_tokens_map.json and then looks up the token in the vocabulary to get the ID. Should we do that, too? Otherwise, I could just revert #2984 and blame the models.

KerfuffleV2 · 2023-10-13T06:14:21Z

Otherwise, I could just revert #2984 and blame the models.

You probably don't want to just revert it. The original behavior was to just crash when trying to convert the model. If you want to just refuse to convert models like that, more friendly handling with a reasonable error message would be better.

cebtenzzre · 2023-10-22T19:43:22Z

AFAIK eos_token_id from config.json is not even used outside of warn_if_padding_and_no_attention_mask, so we probably shouldn't use it at all. tokenizer_config.json appears to be the new place for it:
https://github.com/huggingface/transformers/blob/093848d3ccf3884caf048718b6bae833da0edb94/src/transformers/tokenization_utils_base.py#L1934-L1941

KerfuffleV2 mentioned this issue Sep 3, 2023

gguf: Fix special vocab handling when id < 0 #2984

Merged

KerfuffleV2 closed this as completed in #2984 Sep 3, 2023

cebtenzzre reopened this Oct 11, 2023

cebtenzzre mentioned this issue Oct 18, 2023

support loading vocab from fast tokenizer config in convert.py #3633

Merged

cebtenzzre mentioned this issue Dec 15, 2023

gguf-py : fail fast on nonsensical special token IDs #4489

Merged

cebtenzzre linked a pull request Dec 15, 2023 that will close this issue

gguf-py : fail fast on nonsensical special token IDs #4489

Merged

cebtenzzre closed this as completed in #4489 Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pad_token_id of -1 is not understood by convert.py #2981

pad_token_id of -1 is not understood by convert.py #2981

cebtenzzre commented Sep 3, 2023

cebtenzzre commented Sep 3, 2023

KerfuffleV2 commented Sep 3, 2023

henk717 commented Sep 3, 2023

KerfuffleV2 commented Sep 3, 2023

cebtenzzre commented Oct 11, 2023 •

edited

Loading

KerfuffleV2 commented Oct 11, 2023

cebtenzzre commented Oct 11, 2023 •

edited

Loading

KerfuffleV2 commented Oct 11, 2023

cebtenzzre commented Oct 11, 2023

monatis commented Oct 12, 2023 •

edited

Loading

KerfuffleV2 commented Oct 12, 2023

cebtenzzre commented Oct 12, 2023 •

edited

Loading

KerfuffleV2 commented Oct 13, 2023

cebtenzzre commented Oct 22, 2023 •

edited

Loading

pad_token_id of -1 is not understood by convert.py #2981

pad_token_id of -1 is not understood by convert.py #2981

Comments

cebtenzzre commented Sep 3, 2023

cebtenzzre commented Sep 3, 2023

KerfuffleV2 commented Sep 3, 2023

henk717 commented Sep 3, 2023

KerfuffleV2 commented Sep 3, 2023

cebtenzzre commented Oct 11, 2023 • edited Loading

KerfuffleV2 commented Oct 11, 2023

cebtenzzre commented Oct 11, 2023 • edited Loading

KerfuffleV2 commented Oct 11, 2023

cebtenzzre commented Oct 11, 2023

monatis commented Oct 12, 2023 • edited Loading

KerfuffleV2 commented Oct 12, 2023

cebtenzzre commented Oct 12, 2023 • edited Loading

KerfuffleV2 commented Oct 13, 2023

cebtenzzre commented Oct 22, 2023 • edited Loading

cebtenzzre commented Oct 11, 2023 •

edited

Loading

cebtenzzre commented Oct 11, 2023 •

edited

Loading

monatis commented Oct 12, 2023 •

edited

Loading

cebtenzzre commented Oct 12, 2023 •

edited

Loading

cebtenzzre commented Oct 22, 2023 •

edited

Loading