-
Notifications
You must be signed in to change notification settings - Fork 12k
convert: add support for Japanese Bert model #13830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that other models also use the same script, try not to introduce destructive changes that may affect other models.
convert_hf_to_gguf_update.py
Outdated
except requests.HTTPError as e: | ||
if e.response.status_code == 404: | ||
logger.warning(f"URL not found: {url}") | ||
else: | ||
logger.error(f"HTTP error occurred when downloading {url}: {e}") | ||
except requests.ConnectionError: | ||
logger.error(f"Connection error occurred when downloading {url}") | ||
except Exception as e: | ||
logger.error(f"Unexpected error occurred when downloading {url}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This whole multiple
except
can be just one singleexcept Exception as e
. No need to over-engineer the error handling if you only interested in logging it - The old code doesn't have this handling, so it will simply terminate the script if there an error. Now with this, error will be ignored. I think this is not the expected behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't have access to many models in the list, so the script will be terminated everytime I run it (without commenting other models). The instructions at the beginning of the file state that Add a new model to the "models" list
, which may make users confused
What is your suggestion about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually just temporary comment out all the other models then run the script. But yes having the ability to update only added model will be a better approach. I will add it in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, let's simply remove this change from this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this change.
I will add it in another PR
Thank you in advance!
convert_hf_to_gguf_update.py
Outdated
if "ignore_merges" in cfg["model"]: | ||
logger.info("ignore_merges: " + json.dumps(cfg["model"]["ignore_merges"], indent=4)) | ||
# print the "pre_tokenizer" content from the tokenizer.json, if exists | ||
if os.path.isfile(f"models/tokenizers/{name}/tokenizer.json"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will alter the behavior of other models
instead, check for cfg["word_tokenizer_type"] == "mecab"
and only skip this on that particular model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry. I just fixed that
convert_hf_to_gguf_update.py
Outdated
import subprocess | ||
import importlib.util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it
@@ -117,17 +118,47 @@ class TOKENIZER_TYPE(IntEnum): | |||
{"name": "glm4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", }, | |||
{"name": "pixtral", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistral-community/pixtral-12b", }, | |||
{"name": "seed-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", }, | |||
{"name": "ruri-large", "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/cl-nagoya/ruri-large", }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add it here, you must also run the script so it updates convert_hf_to_gguf
and include the change in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and btw, do we even have the CPP code to handle this? is this already tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested that model and similar models (ruri-*) locally for embedding task and it worked.
If you add it here, you must also run the script so it updates convert_hf_to_gguf and include the change in this PR
I'm sorry. About this, like I said before, I don't have access to many models in the list, so it's hard to run all listed models to update to convert_hf_to_gguf
. Can you do that for me? If not, how do you think we can handle this (Like left a comment telling that some Japanese models require vocab.txt
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When #13847 is merged, you can run the script again and this time it will only process the newly added model
This PR add more supports to Japanese-based models (especially BertJapanese) via:
fugashi[unidic-lite]
when model's tokenization method relies on Mecabdownload_model
function can work with other files if 1 file doesn't exist (since many BertJapanese models don't havetokenizer.json
, which can disrupt downloading process