Skip to content

convert: add support for Japanese Bert model #13830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

huydt84
Copy link
Contributor

@huydt84 huydt84 commented May 27, 2025

This PR add more supports to Japanese-based models (especially BertJapanese) via:

  • Auto install fugashi[unidic-lite] when model's tokenization method relies on Mecab
  • Only print the "pre_tokenizer" content from the tokenizer.json, if the file exists
  • download_model function can work with other files if 1 file doesn't exist (since many BertJapanese models don't have tokenizer.json, which can disrupt downloading process

@github-actions github-actions bot added the python python script changes label May 27, 2025
@huydt84 huydt84 requested a review from ngxson May 27, 2025 22:58
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that other models also use the same script, try not to introduce destructive changes that may affect other models.

Comment on lines 153 to 161
except requests.HTTPError as e:
if e.response.status_code == 404:
logger.warning(f"URL not found: {url}")
else:
logger.error(f"HTTP error occurred when downloading {url}: {e}")
except requests.ConnectionError:
logger.error(f"Connection error occurred when downloading {url}")
except Exception as e:
logger.error(f"Unexpected error occurred when downloading {url}: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This whole multiple except can be just one single except Exception as e. No need to over-engineer the error handling if you only interested in logging it
  2. The old code doesn't have this handling, so it will simply terminate the script if there an error. Now with this, error will be ignored. I think this is not the expected behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't have access to many models in the list, so the script will be terminated everytime I run it (without commenting other models). The instructions at the beginning of the file state that Add a new model to the "models" list, which may make users confused

What is your suggestion about this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually just temporary comment out all the other models then run the script. But yes having the ability to update only added model will be a better approach. I will add it in another PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, let's simply remove this change from this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this change.

I will add it in another PR

Thank you in advance!

if "ignore_merges" in cfg["model"]:
logger.info("ignore_merges: " + json.dumps(cfg["model"]["ignore_merges"], indent=4))
# print the "pre_tokenizer" content from the tokenizer.json, if exists
if os.path.isfile(f"models/tokenizers/{name}/tokenizer.json"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will alter the behavior of other models

instead, check for cfg["word_tokenizer_type"] == "mecab" and only skip this on that particular model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry. I just fixed that

Comment on lines 25 to 26
import subprocess
import importlib.util
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it

@@ -117,17 +118,47 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "glm4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", },
{"name": "pixtral", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistral-community/pixtral-12b", },
{"name": "seed-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", },
{"name": "ruri-large", "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/cl-nagoya/ruri-large", },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and btw, do we even have the CPP code to handle this? is this already tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that model and similar models (ruri-*) locally for embedding task and it worked.

If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR

I'm sorry. About this, like I said before, I don't have access to many models in the list, so it's hard to run all listed models to update to convert_hf_to_gguf. Can you do that for me? If not, how do you think we can handle this (Like left a comment telling that some Japanese models require vocab.txt)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When #13847 is merged, you can run the script again and this time it will only process the newly added model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants