Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues loading tokenizer/Support loading tokenizer.model? #239

Closed
bianchidotdev opened this issue Sep 8, 2023 · 5 comments
Closed

Issues loading tokenizer/Support loading tokenizer.model? #239

bianchidotdev opened this issue Sep 8, 2023 · 5 comments

Comments

@bianchidotdev
Copy link

I'm having issues loading certain models on Huggingface that might largely be an issue with those repos rather than bumblebee.

What I'm seeing:

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openlm-research/open_llama_3b_v2"})

** (MatchError) no match of right hand side value: {:error, "file not found"}

It looks like it's failing searching for a tokenizer.json. Unfortunately, the huggingface repo ships only with a tokenizer.model and related config files but not a tokenizer.json and it appears quite a few models on huggingface follow suit.

I'm not sure what the effort would be to support loading the model directly or if there are other ways around this.

@jonatanklosko
Copy link
Member

Usually we look for a different repository that uses the same tokenizer and has tokenizer.json. In this case you can try yhyhy3/open_llama_7b_v2_med_instruct, which is fine-tuned version of the original repo and likely uses the same tokenizer.

According to this paragraph the "fast tokenizer" (dumped/loaded from tokenizer.json) used to give wrong results, but this seems to have been resolved in huggingface/transformers#24233.

We can send a PR to the hf repo with the tokenizer file, which we did for a couple repos in the past, so I will keep this open :)

@bianchidotdev
Copy link
Author

Thanks a lot! I was looking around pretty hard for a 1B or 3B model to test with on my laptop since I don't have the memory really needed to run with a 7B+ model but that makes sense.

For my own reference and usage, is generating a tokenizer.json as simple as it seems with the following:

# python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("<model>")
tokenizer.save_pretrained("<dir_to_save_to>")

Do you have any sense for the work to handle the model file natively in elixir is?

@jonatanklosko
Copy link
Member

@bianchidotdev this is precisely it! When you call AutoTokenizer.from_pretrained it will fetch the vocab/config/merges files and create a "slow tokenizer", then they attempt to convert it to a fast tokenizer if possible. If the conversion works, then tokenizer is "fast tokenizer" and save_pretrained dumps it into tokenizer.json, which is the file we rely on.

If you want to open a PR on the HF repos, here's an example, just make sure you have latest transformers installed locally before doing the conversion. No pressure though, I can also do it later :)

@jonatanklosko
Copy link
Member

@bianchidotdev I opened a PR while testing a new conversion tool and I noticed you opened one already, thanks!

FTR you don't have to wait for the PR to be merged, you can just reference the PR commit directly:

{:ok, tokenizer} =
  Bumblebee.load_tokenizer(
    {:hf, "openlm-research/open_llama_3b_v2",
     revision: "52944fc4e35e6ca00e733b95df79498728016e1d"}
  )

@jonatanklosko
Copy link
Member

Also, I improved the error messages in #256, so it will be clear why the tokenizer cannot be loaded. And we have a new section in the README with actions the user may take :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants