Can not find vocabulary file for Chinese model #34

zlinao · 2018-11-18T14:33:58Z

After I convert the TF model to pytorch model, I run a classification task on a new Chinese dataset, but get this:

CUDA_VISIBLE_DEVICES=3 python run_classifier.py --task_name weibo --do_eval --do_train --bert_model chinese_L-12_H-768_A-12 --max_seq_length 128 --train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir bert_result

11/18/2018 21:56:59 - INFO - main - device cuda n_gpu 1 distributed training False
11/18/2018 21:56:59 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file chinese_L-12_H-768_A-12
Traceback (most recent call last):
File "run_classifier.py", line 661, in
main()
File "run_classifier.py", line 508, in main
tokenizer = BertTokenizer.from_pretrained(args.bert_model)
File "/home/lin/jpmorgan/pytorch-pretrained-BERT/pytorch_pretrained_bert/tokenization.py", line 141, in from_pretrained
tokenizer = cls(resolved_vocab_file, do_lower_case)
File "/home/lin/jpmorgan/pytorch-pretrained-BERT/pytorch_pretrained_bert/tokenization.py", line 94, in init
"model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)".format(vocab_file))
ValueError: Can't find a vocabulary file at path 'chinese_L-12_H-768_A-12'. To load the vocabulary from a Google pretrained model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

The text was updated successfully, but these errors were encountered:

zlinao · 2018-11-19T03:20:09Z

need to specify the path of vocab.txt for:
tokenizer = BertTokenizer.from_pretrained(args.bert_model)

coddinglxf · 2018-11-19T07:42:11Z

@zlinao ，i try to load the vocab using the following code：
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"

however，get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc

do you have the same problem？

thomwolf · 2018-11-19T08:39:20Z

Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') as indicated in the readme and the run_classifier.py example?

zlinao · 2018-11-19T11:11:53Z

Hi,
Why don't you guys just do tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') as indicated in the readme and the run_classifier.py example?

Yes, it is easier to use shortcut name. Thanks for your great work.

zlinao · 2018-11-19T11:13:14Z

@zlinao ，i try to load the vocab using the following code：
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt"

however，get errors
11/19/2018 15:33:13 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file bert-base-chinese//vocab.txt
Traceback (most recent call last):
File "E:/PythonWorkSpace/PytorchBert/BertTest/torchTest.py", line 6, in
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese//vocab.txt")
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 141, in from_pretrained
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 95, in init
File "C:\anaconda\lib\site-packages\pytorch_pretrained_bert-0.1.2-py3.6.egg\pytorch_pretrained_bert\tokenization.py", line 70, in load_vocab
UnicodeDecodeError: 'gbk' codec can't decode byte 0x81 in position 1564: illegal multibyte sequenc

do you have the same problem？

you can change you encoding to 'utf-8' when you load the vocab.txt

update adversarial training for roberta question anwsering

Creating the custom kernel on the fly.

Ra

* Replace matmul with einsum * Fix assertion

Change default quantized model save basename

zlinao closed this as completed Nov 19, 2018

maeotaku mentioned this issue May 23, 2019

bert->onnx ->caffe2 weird error #633

Closed

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020

Merge pull request huggingface#34 from stevezheng23/dev/zheng/coqa

9db14c4

update adversarial training for roberta question anwsering

nakarin mentioned this issue Jan 16, 2021

ImportError: cannot import name 'Dataset' #9631

Closed

Narsil pushed a commit to Narsil/transformers that referenced this issue Jan 25, 2022

Merge pull request huggingface#34 from Narsil/add_deformable_detr

e0ef9dd

Creating the custom kernel on the fly.

jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this issue Jun 1, 2023

Merge pull request huggingface#34 from jamesthesnake/ra

1615531

Ra

lwmlyy mentioned this issue Aug 15, 2023

add util for ram efficient loading of model when using fsdp #25107

Merged

1 task

jonb377 added a commit to jonb377/hf-transformers that referenced this issue Nov 3, 2023

Replace matmul with einsum (huggingface#34)

3bead04

* Replace matmul with einsum * Fix assertion

ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024

Merge pull request huggingface#34 from PanQiWei/change-save-name

a535c57

Change default quantized model save basename

ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024

Fix format-compat is saved meta instead of root dict (huggingface#34)

64da77c

ZYC-ModelCloud pushed a commit to ZYC-ModelCloud/transformers that referenced this issue Nov 14, 2024

Update run_tests.yml (huggingface#34)

a635d8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not find vocabulary file for Chinese model #34

Can not find vocabulary file for Chinese model #34

zlinao commented Nov 18, 2018

zlinao commented Nov 19, 2018

coddinglxf commented Nov 19, 2018

thomwolf commented Nov 19, 2018

zlinao commented Nov 19, 2018

zlinao commented Nov 19, 2018

Can not find vocabulary file for Chinese model #34

Can not find vocabulary file for Chinese model #34

Comments

zlinao commented Nov 18, 2018

zlinao commented Nov 19, 2018

coddinglxf commented Nov 19, 2018

thomwolf commented Nov 19, 2018

zlinao commented Nov 19, 2018

zlinao commented Nov 19, 2018