Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于MLM预测句子中[MASK]的候选词的问题 #11

Open
ouwenjie03 opened this issue May 28, 2021 · 1 comment
Open

关于MLM预测句子中[MASK]的候选词的问题 #11

ouwenjie03 opened this issue May 28, 2021 · 1 comment

Comments

@ouwenjie03
Copy link

我尝试了用google的chinese-bert模型 和 zhuiyi的wobert-plus模型,来进行预测句子中[MASK]的候选词的实验,发现wobert-plus得到的结果都是停用词,请教一下哪里操作错误了。

predict函数:

def TopCandidates(token_ids, i, topn=64):
    """用语言模型给出第i个位置的topn个候选token
    """
    token_ids_i = token_ids[i]
    token_ids[i] = tokenizer._token_mask_id
    token_ids = np.array([token_ids])
    probas = model.predict(token_ids)[0, i]
    ids = list(probas.argsort()[::-1][:topn])
    if token_ids_i in ids:
        ids.remove(token_ids_i)
    else:
        ids = ids[:-1]
    return_token_ids = [token_ids_i] + ids
    return_probas = [probas[_i] for _i in return_token_ids]
    return return_token_ids, return_probas   # 将输入token放在第一位,方便调用

这个是load google的chinese-bert模型

tokenizer = Tokenizer(
    dict_path,
    do_lower_case=True,
)  # 建立分词器

model = build_transformer_model(
    config_path,
    checkpoint_path,
    segment_vocab_size=0,  # 去掉segmeng_ids输入
    with_mlm=True,
)

sent = '习近平总书记是一位有着47年党龄的共产党员。'
token_ids = tokenizer.encode(sent)[0]
print(token_ids)
print(len(token_ids))
words = tokenizer.ids_to_tokens(token_ids)
print(words)

return_token_ids, return_probas = TopCandidates(token_ids, i=4, topn=8)
for tid, tp in zip(return_token_ids, return_probas):
    print(tid, tokenizer.id_to_token(tid), tp)

output:

[101, 739, 6818, 2398, 2600, 741, 6381, 3221, 671, 855, 3300, 4708, 8264, 2399, 1054, 7977, 4638, 1066, 772, 1054, 1447, 511, 102]
23
['[CLS]', '习', '近', '平', '总', '书', '记', '是', '一', '位', '有', '着', '47', '年', '党', '龄', '的', '共', '产', '党', '员', '。', '[SEP]']
2600 总 0.99927753
5439 老 0.00027872992
4638 的 0.00024593325
1398 同 4.721549e-05
5244 總 3.3572527e-05
1199 副 1.1924949e-05
2218 就 8.368754e-06
3295 曾 7.822215e-06

这个是load zhuiyi的wobert-plus

tokenizer = Tokenizer(
    dict_path,
    do_lower_case=True,
    pre_tokenize=lambda s: jieba.cut(s, HMM=False),
)  # 建立分词器

model = build_transformer_model(
    config_path,
    checkpoint_path,
    segment_vocab_size=0,  # 去掉segmeng_ids输入
    with_mlm=True,
)

sent = '习近平总书记是一位有着47年党龄的共产党员。'
token_ids = tokenizer.encode(sent)[0]
print(token_ids)
print(len(token_ids))
words = tokenizer.ids_to_tokens(token_ids)
print(words)

return_token_ids, return_probas = TopCandidates(token_ids, i=4, topn=8)
for tid, tp in zip(return_token_ids, return_probas):
    print(tid, tokenizer.id_to_token(tid), tp)

output:

[101, 36572, 39076, 2274, 6243, 21309, 5735, 1625, 513, 5651, 3399, 44374, 179, 102]
14
['[CLS]', '习近平', '总书记', '是', '一位', '有着', '47', '年', '党', '龄', '的', '共产党员', '。', '[SEP]']
6243 一位 2.1206936e-09 # 概率低
101 [CLS] 0.8942671
102 [SEP] 0.10569866
179 。 2.877889e-06
5661 , 1.2525259e-06
3399 的 1.0122681e-06
178 、 7.5024326e-07
5663 : 5.766404e-07

@ouhongxu
Copy link

可以参照bert4keras中给的实例解决该问题,这个实例的链接:https://github.com/bojone/bert4keras/blob/master/examples/basic_masked_language_model.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants