-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilingual Issue #49
Comments
Hi, you can use the multilingual model as indicated in the readme with the commands: tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual')
model = BertModel.from_pretrained('bert-base-multilingual') This will load the multilingual vocabulary (which should contain korean) that your command was not loading. |
stevezheng23
added a commit
to stevezheng23/transformers
that referenced
this issue
Mar 24, 2020
add da-coqa runner with question history augmentation support
1 task
jonb377
added a commit
to jonb377/hf-transformers
that referenced
this issue
Apr 5, 2024
ZYC-ModelCloud
pushed a commit
to ZYC-ModelCloud/transformers
that referenced
this issue
Nov 14, 2024
ZYC-ModelCloud
pushed a commit
to ZYC-ModelCloud/transformers
that referenced
this issue
Nov 14, 2024
…ngface#47) (huggingface#49) * fix cannot pickle 'module' object for 8 bit * remove unused import * remove print * check with tuple * revert to len check * add test for 8bit * set same QuantizeConfig * check if it's 4 bit * fix grammar * remove params * it's not a list * set gptqmodel_cuda back * check is tuple * format * set desc_act=True * set desc_act=True * format * format * Refractor fix * desc_act=True --------- Co-authored-by: Qubitium <Qubitium@modelcloud.ai>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Dear authors,
I have two questions.
First, how can I use multilingual pre-trained BERT in pytorch?
Is it all download model to $BERT_BASE_DIR?
Second is tokenization issue.
For Chinese and Japanese, tokenizer may works, however, for Korean, it shows different result that I expected
` ['ᄋ', '##ᅡ', '##ᆫ', '##ᄂ', '##ᅧ', '##ᆼ', '##ᄒ', '##ᅡ', '##ᄉ', '##ᅦ', '##ᄋ', '##ᅭ']
The result is based on not 'character' but 'byte-based character'
May it comes from unicode issue. (I expect ['안녕', '##하세요'])
The text was updated successfully, but these errors were encountered: