Issues loading tokenizer/Support loading tokenizer.model? #239

bianchidotdev · 2023-09-08T12:41:16Z

I'm having issues loading certain models on Huggingface that might largely be an issue with those repos rather than bumblebee.

What I'm seeing:

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openlm-research/open_llama_3b_v2"})

** (MatchError) no match of right hand side value: {:error, "file not found"}

It looks like it's failing searching for a tokenizer.json. Unfortunately, the huggingface repo ships only with a tokenizer.model and related config files but not a tokenizer.json and it appears quite a few models on huggingface follow suit.

I'm not sure what the effort would be to support loading the model directly or if there are other ways around this.

The text was updated successfully, but these errors were encountered:

jonatanklosko · 2023-09-08T14:51:22Z

Usually we look for a different repository that uses the same tokenizer and has tokenizer.json. In this case you can try yhyhy3/open_llama_7b_v2_med_instruct, which is fine-tuned version of the original repo and likely uses the same tokenizer.

According to this paragraph the "fast tokenizer" (dumped/loaded from tokenizer.json) used to give wrong results, but this seems to have been resolved in huggingface/transformers#24233.

We can send a PR to the hf repo with the tokenizer file, which we did for a couple repos in the past, so I will keep this open :)

bianchidotdev · 2023-09-08T15:19:34Z

Thanks a lot! I was looking around pretty hard for a 1B or 3B model to test with on my laptop since I don't have the memory really needed to run with a 7B+ model but that makes sense.

For my own reference and usage, is generating a tokenizer.json as simple as it seems with the following:

# python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("<model>")
tokenizer.save_pretrained("<dir_to_save_to>")

Do you have any sense for the work to handle the model file natively in elixir is?

jonatanklosko · 2023-09-08T15:40:19Z

@bianchidotdev this is precisely it! When you call AutoTokenizer.from_pretrained it will fetch the vocab/config/merges files and create a "slow tokenizer", then they attempt to convert it to a fast tokenizer if possible. If the conversion works, then tokenizer is "fast tokenizer" and save_pretrained dumps it into tokenizer.json, which is the file we rely on.

If you want to open a PR on the HF repos, here's an example, just make sure you have latest transformers installed locally before doing the conversion. No pressure though, I can also do it later :)

jonatanklosko · 2023-09-27T18:37:11Z

@bianchidotdev I opened a PR while testing a new conversion tool and I noticed you opened one already, thanks!

FTR you don't have to wait for the PR to be merged, you can just reference the PR commit directly:

{:ok, tokenizer} =
  Bumblebee.load_tokenizer(
    {:hf, "openlm-research/open_llama_3b_v2",
     revision: "52944fc4e35e6ca00e733b95df79498728016e1d"}
  )

jonatanklosko · 2023-09-27T18:40:04Z

Also, I improved the error messages in #256, so it will be clear why the tokenizer cannot be loaded. And we have a new section in the README with actions the user may take :)

jonatanklosko closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues loading tokenizer/Support loading tokenizer.model? #239

Issues loading tokenizer/Support loading tokenizer.model? #239

bianchidotdev commented Sep 8, 2023

jonatanklosko commented Sep 8, 2023

bianchidotdev commented Sep 8, 2023

jonatanklosko commented Sep 8, 2023

jonatanklosko commented Sep 27, 2023

jonatanklosko commented Sep 27, 2023

Issues loading tokenizer/Support loading tokenizer.model? #239

Issues loading tokenizer/Support loading tokenizer.model? #239

Comments

bianchidotdev commented Sep 8, 2023

jonatanklosko commented Sep 8, 2023

bianchidotdev commented Sep 8, 2023

jonatanklosko commented Sep 8, 2023

jonatanklosko commented Sep 27, 2023

jonatanklosko commented Sep 27, 2023