Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分词器报错了 #79

Open
YangFW opened this issue Aug 19, 2022 · 13 comments
Open

分词器报错了 #79

YangFW opened this issue Aug 19, 2022 · 13 comments

Comments

@YangFW
Copy link

YangFW commented Aug 19, 2022

昨天出现个问题,加载词向量文件时出现下面截图中的问题
image

看上去是网络问题,但是我这边能正常访问https://huggingface.co/models的
后从该地址上下载了词向量 nghuyong/ernie-1.0-base-zh并指定绝对路径,程序正常了,但是之前的模型准确性全不对了,是词向量文件改了吗,还是什么原因?

@xiangking
Copy link
Owner

您好,这个问题是ernie那边把1.0的名字改成nghuyong/ernie-1.0-base-zh,删掉了之前的,看了一下文件应该是没有改的,具体我这边还没实验。

@YangFW
Copy link
Author

YangFW commented Aug 19, 2022

您好,这个问题是ernie那边把1.0的名字改成nghuyong/ernie-1.0-base-zh,删掉了之前的,看了一下文件应该是没有改的,具体我这边还没实验。

嗯,我改成 nghuyong/ernie-1.0-base-zh 这个后,可以跑了,但是模型不准了,之前的用nghuyong/ernie-1.0时训练好的模型,现在一点不准了

@xiangking
Copy link
Owner

我这边尝试了一下使用nghuyong/ernie-1.0-base-zh加载config和tokenizer,使用nghuyong/ernie-1.0作为模型预训练参数,似乎没有问题,两者应该是适配的,可以检查一下是否cat2id之类的值有问题

@YangFW
Copy link
Author

YangFW commented Aug 19, 2022

我这边尝试了一下使用nghuyong/ernie-1.0-base-zh加载config和tokenizer,使用nghuyong/ernie-1.0作为模型预训练参数,似乎没有问题,两者应该是适配的,可以检查一下是否cat2id之类的值有问题

cat2id,config,tokenizer都正常的,就是之前训练好了的模型现在完全不能用了,算了我下载词向量到本地重写训练一下吧

@xiangking
Copy link
Owner

xiangking commented Aug 19, 2022 via email

@YangFW
Copy link
Author

YangFW commented Aug 22, 2022

我这边也会跟进看看有没有什么问题,感谢这边提出的issue 在2022年08月19日 @.> 写道: 我这边尝试了一下使用nghuyong/ernie-1.0-base-zh加载config和tokenizer,使用nghuyong/ernie-1.0作为模型预训练参数,似乎没有问题,两者应该是适配的,可以检查一下是否cat2id之类的值有问题 cat2id,config,tokenizer都正常的,就是之前训练好了的模型现在完全不能用了,算了我下载词向量到本地重写训练一下吧 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

您好,我这边重新训练后模型准确度、F1都比较好,但是保存模型参数重新加载出来测试,效果又很不好,请问有遇到类似的问题吗,之前都是正常的

@xiangking
Copy link
Owner

xiangking commented Aug 22, 2022 via email

@YangFW
Copy link
Author

YangFW commented Aug 22, 2022

您好,是之前保存模型再重新加载可以正常运行,现在是出现准确率很低的情况吗 Message ID: @.***>

之前是这样的,就是在ernie-1.0转成ernie-1.0-base-zh的时候,模型准确率很低,所有的数据分类结果都是0,之前以为是词向量的问题。
现在发现可能不是这个问题,现在训练的时候测试+f1+混淆矩阵都很正常,然后保存模型参数,再重新加载测试,测试结果全是0。
就是比对两个句子的相似度,相当于01分类

@xiangking
Copy link
Owner

xiangking commented Aug 22, 2022 via email

@YangFW
Copy link
Author

YangFW commented Aug 22, 2022

是使用的TM模型吗,是不是加载或者保存有问题,方便的话,也可以加我微信,我这边看看是什么问题 Message ID: @.***>

对,使用的tm模型
可以啊,我的微信号是 18758835758,非常感谢

@dandrane
Copy link

想请问下,我换成'nghuyong/ernie-1.0-base-zh'后,一直报错:

Traceback (most recent call last):
  File "textsim.py", line 26, in <module>
    bert_vocab = transformers.AutoTokenizer.from_pretrained("nghuyong/ernie-1.0-base-zh")
  File "/home/labo/anaconda3/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 532, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/labo/anaconda3/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 452, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
KeyError: 'ernie'

vocab、config的导入没有变化,

from ark_nlp.model.tm.bert import Bert
from ark_nlp.model.tm.bert import BertConfig
from ark_nlp.model.tm.bert import Dataset
from ark_nlp.model.tm.bert import Task
from ark_nlp.model.tm.bert import get_default_model_optimizer
from ark_nlp.model.tm.bert import Tokenizer
model_name='nghuyong/ernie-1.0-base-zh',
tokenizer = Tokenizer(vocab=model_name, max_seq_len=30)
bert_config = BertConfig.from_pretrained(model_name,
                                             num_labels=len(tm_train_dataset.cat2id))

是需要进行什么变动吗

@jimme0001
Copy link

想请问下,我换成'nghuyong/ernie-1.0-base-zh'后,一直报错:

Traceback (most recent call last):
  File "textsim.py", line 26, in <module>
    bert_vocab = transformers.AutoTokenizer.from_pretrained("nghuyong/ernie-1.0-base-zh")
  File "/home/labo/anaconda3/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 532, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/labo/anaconda3/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 452, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
KeyError: 'ernie'

vocab、config的导入没有变化,

from ark_nlp.model.tm.bert import Bert
from ark_nlp.model.tm.bert import BertConfig
from ark_nlp.model.tm.bert import Dataset
from ark_nlp.model.tm.bert import Task
from ark_nlp.model.tm.bert import get_default_model_optimizer
from ark_nlp.model.tm.bert import Tokenizer
model_name='nghuyong/ernie-1.0-base-zh',
tokenizer = Tokenizer(vocab=model_name, max_seq_len=30)
bert_config = BertConfig.from_pretrained(model_name,
                                             num_labels=len(tm_train_dataset.cat2id))

是需要进行什么变动吗

开源的ernie版本迭代问题,有两种方法解决。

一是直接将 model_name = 'freedomking/ernie-1.0' ,这是我们上传的老版本ernie。
二是升级transformers,使得transformers> 4.22.0,参考 ERNIE-Pytorch

@dandrane
Copy link

想请问下,我换成'nghuyong/ernie-1.0-base-zh'后,一直报错:

Traceback (most recent call last):
  File "textsim.py", line 26, in <module>
    bert_vocab = transformers.AutoTokenizer.from_pretrained("nghuyong/ernie-1.0-base-zh")
  File "/home/labo/anaconda3/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 532, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/labo/anaconda3/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 452, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
KeyError: 'ernie'

vocab、config的导入没有变化,

from ark_nlp.model.tm.bert import Bert
from ark_nlp.model.tm.bert import BertConfig
from ark_nlp.model.tm.bert import Dataset
from ark_nlp.model.tm.bert import Task
from ark_nlp.model.tm.bert import get_default_model_optimizer
from ark_nlp.model.tm.bert import Tokenizer
model_name='nghuyong/ernie-1.0-base-zh',
tokenizer = Tokenizer(vocab=model_name, max_seq_len=30)
bert_config = BertConfig.from_pretrained(model_name,
                                             num_labels=len(tm_train_dataset.cat2id))

是需要进行什么变动吗

开源的ernie版本迭代问题,有两种方法解决。

一是直接将 model_name = 'freedomking/ernie-1.0' ,这是我们上传的老版本ernie。 二是升级transformers,使得transformers> 4.22.0,参考 ERNIE-Pytorch

嗯嗯是的,此外,我将bert_vocab = transformers.AutoTokenizer.from_pretrained("nghuyong/ernie-1.0-base-zh") 替换为bert_vocab = transformers.BertTokenizer.from_pretrained("nghuyong/ernie-1.0-base-zh")也可以运行,谢谢您

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants