We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我尝试了用google的chinese-bert模型 和 zhuiyi的wobert-plus模型,来进行预测句子中[MASK]的候选词的实验,发现wobert-plus得到的结果都是停用词,请教一下哪里操作错误了。
predict函数:
def TopCandidates(token_ids, i, topn=64): """用语言模型给出第i个位置的topn个候选token """ token_ids_i = token_ids[i] token_ids[i] = tokenizer._token_mask_id token_ids = np.array([token_ids]) probas = model.predict(token_ids)[0, i] ids = list(probas.argsort()[::-1][:topn]) if token_ids_i in ids: ids.remove(token_ids_i) else: ids = ids[:-1] return_token_ids = [token_ids_i] + ids return_probas = [probas[_i] for _i in return_token_ids] return return_token_ids, return_probas # 将输入token放在第一位,方便调用
这个是load google的chinese-bert模型
tokenizer = Tokenizer( dict_path, do_lower_case=True, ) # 建立分词器 model = build_transformer_model( config_path, checkpoint_path, segment_vocab_size=0, # 去掉segmeng_ids输入 with_mlm=True, ) sent = '习近平总书记是一位有着47年党龄的共产党员。' token_ids = tokenizer.encode(sent)[0] print(token_ids) print(len(token_ids)) words = tokenizer.ids_to_tokens(token_ids) print(words) return_token_ids, return_probas = TopCandidates(token_ids, i=4, topn=8) for tid, tp in zip(return_token_ids, return_probas): print(tid, tokenizer.id_to_token(tid), tp)
output:
[101, 739, 6818, 2398, 2600, 741, 6381, 3221, 671, 855, 3300, 4708, 8264, 2399, 1054, 7977, 4638, 1066, 772, 1054, 1447, 511, 102] 23 ['[CLS]', '习', '近', '平', '总', '书', '记', '是', '一', '位', '有', '着', '47', '年', '党', '龄', '的', '共', '产', '党', '员', '。', '[SEP]'] 2600 总 0.99927753 5439 老 0.00027872992 4638 的 0.00024593325 1398 同 4.721549e-05 5244 總 3.3572527e-05 1199 副 1.1924949e-05 2218 就 8.368754e-06 3295 曾 7.822215e-06
这个是load zhuiyi的wobert-plus
tokenizer = Tokenizer( dict_path, do_lower_case=True, pre_tokenize=lambda s: jieba.cut(s, HMM=False), ) # 建立分词器 model = build_transformer_model( config_path, checkpoint_path, segment_vocab_size=0, # 去掉segmeng_ids输入 with_mlm=True, ) sent = '习近平总书记是一位有着47年党龄的共产党员。' token_ids = tokenizer.encode(sent)[0] print(token_ids) print(len(token_ids)) words = tokenizer.ids_to_tokens(token_ids) print(words) return_token_ids, return_probas = TopCandidates(token_ids, i=4, topn=8) for tid, tp in zip(return_token_ids, return_probas): print(tid, tokenizer.id_to_token(tid), tp)
[101, 36572, 39076, 2274, 6243, 21309, 5735, 1625, 513, 5651, 3399, 44374, 179, 102] 14 ['[CLS]', '习近平', '总书记', '是', '一位', '有着', '47', '年', '党', '龄', '的', '共产党员', '。', '[SEP]'] 6243 一位 2.1206936e-09 # 概率低 101 [CLS] 0.8942671 102 [SEP] 0.10569866 179 。 2.877889e-06 5661 , 1.2525259e-06 3399 的 1.0122681e-06 178 、 7.5024326e-07 5663 : 5.766404e-07
The text was updated successfully, but these errors were encountered:
可以参照bert4keras中给的实例解决该问题,这个实例的链接:https://github.com/bojone/bert4keras/blob/master/examples/basic_masked_language_model.py
Sorry, something went wrong.
No branches or pull requests
我尝试了用google的chinese-bert模型 和 zhuiyi的wobert-plus模型,来进行预测句子中[MASK]的候选词的实验,发现wobert-plus得到的结果都是停用词,请教一下哪里操作错误了。
predict函数:
这个是load google的chinese-bert模型
output:
这个是load zhuiyi的wobert-plus
output:
The text was updated successfully, but these errors were encountered: