convert: add support for Japanese Bert model #13830

huydt84 · 2025-05-27T15:28:03Z

This PR add more supports to Japanese-based models (especially BertJapanese) via:

Auto install fugashi[unidic-lite] when model's tokenization method relies on Mecab
Only print the "pre_tokenizer" content from the tokenizer.json, if the file exists
download_model function can work with other files if 1 file doesn't exist (since many BertJapanese models don't have tokenizer.json, which can disrupt downloading process

convert_hf_to_gguf_update.py

ngxson

Keep in mind that other models also use the same script, try not to introduce destructive changes that may affect other models.

ngxson · 2025-05-27T23:12:58Z

convert_hf_to_gguf_update.py

+    except requests.HTTPError as e:
+        if e.response.status_code == 404:
+            logger.warning(f"URL not found: {url}")
+        else:
+            logger.error(f"HTTP error occurred when downloading {url}: {e}")
+    except requests.ConnectionError:
+        logger.error(f"Connection error occurred when downloading {url}")
+    except Exception as e:
+        logger.error(f"Unexpected error occurred when downloading {url}: {e}")


This whole multiple except can be just one single except Exception as e. No need to over-engineer the error handling if you only interested in logging it

The old code doesn't have this handling, so it will simply terminate the script if there an error. Now with this, error will be ignored. I think this is not the expected behavior

Actually I don't have access to many models in the list, so the script will be terminated everytime I run it (without commenting other models). The instructions at the beginning of the file state that Add a new model to the "models" list, which may make users confused

What is your suggestion about this?

I usually just temporary comment out all the other models then run the script. But yes having the ability to update only added model will be a better approach. I will add it in another PR

For now, let's simply remove this change from this PR

I removed this change.

I will add it in another PR

Thank you in advance!

ngxson · 2025-05-27T23:16:13Z

convert_hf_to_gguf_update.py

-        if "ignore_merges" in cfg["model"]:
-            logger.info("ignore_merges: " + json.dumps(cfg["model"]["ignore_merges"], indent=4))
+    # print the "pre_tokenizer" content from the tokenizer.json, if exists
+    if os.path.isfile(f"models/tokenizers/{name}/tokenizer.json"):


This will alter the behavior of other models

instead, check for cfg["word_tokenizer_type"] == "mecab" and only skip this on that particular model

I'm sorry. I just fixed that

ngxson · 2025-05-28T08:09:07Z

convert_hf_to_gguf_update.py

+import subprocess
+import importlib.util


this can be removed

I removed it

ngxson · 2025-05-28T08:11:02Z

convert_hf_to_gguf_update.py

@@ -117,17 +118,47 @@ class TOKENIZER_TYPE(IntEnum):
    {"name": "glm4",             "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", },
    {"name": "pixtral",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistral-community/pixtral-12b", },
    {"name": "seed-coder",       "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", },
+    {"name": "ruri-large",       "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/cl-nagoya/ruri-large", },


If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR

and btw, do we even have the CPP code to handle this? is this already tested?

I tested that model and similar models (ruri-*) locally for embedding task and it worked.

If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR

I'm sorry. About this, like I said before, I don't have access to many models in the list, so it's hard to run all listed models to update to convert_hf_to_gguf. Can you do that for me? If not, how do you think we can handle this (Like left a comment telling that some Japanese models require vocab.txt)

When #13847 is merged, you can run the script again and this time it will only process the newly added model

convert: add support for Japanese Bert model

036b5d6

github-actions bot added the python python script changes label May 27, 2025

ngxson reviewed May 27, 2025

View reviewed changes

convert_hf_to_gguf_update.py Outdated Show resolved Hide resolved

remove auto install, only throw error if fugashi is missing

c484802

huydt84 requested a review from ngxson May 27, 2025 22:58

ngxson requested changes May 27, 2025

View reviewed changes

Only skip pre_tokenizer print for mecab tokenizer type

4e0f769

ngxson reviewed May 28, 2025

View reviewed changes

huydt-bti added 2 commits May 28, 2025 19:06

restore download_file_with_auth

547b380

small import lint restore

0192cab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert: add support for Japanese Bert model #13830

convert: add support for Japanese Bert model #13830

huydt84 commented May 27, 2025

Uh oh!

Uh oh!

ngxson left a comment •

edited

Loading

Uh oh!

ngxson May 27, 2025

Uh oh!

huydt84 May 28, 2025

Uh oh!

ngxson May 28, 2025

Uh oh!

ngxson May 28, 2025

Uh oh!

huydt84 May 28, 2025

Uh oh!

ngxson May 27, 2025

Uh oh!

huydt84 May 28, 2025

Uh oh!

ngxson May 28, 2025

Uh oh!

huydt84 May 28, 2025

Uh oh!

ngxson May 28, 2025

Uh oh!

ngxson May 28, 2025

Uh oh!

huydt84 May 28, 2025

Uh oh!

ngxson May 28, 2025

Uh oh!

Uh oh!

convert: add support for Japanese Bert model #13830

Are you sure you want to change the base?

convert: add support for Japanese Bert model #13830

Conversation

huydt84 commented May 27, 2025

Uh oh!

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson left a comment •

edited

Loading