Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual Issue #49

Closed
hahmyg opened this issue Nov 21, 2018 · 1 comment
Closed

Multilingual Issue #49

hahmyg opened this issue Nov 21, 2018 · 1 comment

Comments

@hahmyg
Copy link

hahmyg commented Nov 21, 2018

Dear authors,
I have two questions.

First, how can I use multilingual pre-trained BERT in pytorch?
Is it all download model to $BERT_BASE_DIR?

Second is tokenization issue.
For Chinese and Japanese, tokenizer may works, however, for Korean, it shows different result that I expected

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "안녕하세요"
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

` ['ᄋ', '##ᅡ', '##ᆫ', '##ᄂ', '##ᅧ', '##ᆼ', '##ᄒ', '##ᅡ', '##ᄉ', '##ᅦ', '##ᄋ', '##ᅭ']

The result is based on not 'character' but 'byte-based character'
May it comes from unicode issue. (I expect ['안녕', '##하세요'])

@thomwolf
Copy link
Member

Hi, you can use the multilingual model as indicated in the readme with the commands:

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual')
model = BertModel.from_pretrained('bert-base-multilingual')

This will load the multilingual vocabulary (which should contain korean) that your command was not loading.

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020
add da-coqa runner with question history augmentation support
jonb377 added a commit to jonb377/hf-transformers that referenced this issue Apr 5, 2024
ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024
ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024
…ngface#47) (huggingface#49)

* fix cannot pickle 'module' object for 8 bit

* remove unused import

* remove print

* check with tuple

* revert to len check

* add test for 8bit

* set same QuantizeConfig

* check if it's 4 bit

* fix grammar

* remove params

* it's not a list

* set gptqmodel_cuda back

* check is tuple

* format

* set desc_act=True

* set desc_act=True

* format

* format

* Refractor fix

* desc_act=True

---------

Co-authored-by: Qubitium <Qubitium@modelcloud.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants