Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自定義vocab.txt #2649

Closed
NatLee opened this issue Jun 27, 2022 · 9 comments
Closed

自定義vocab.txt #2649

NatLee opened this issue Jun 27, 2022 · 9 comments
Assignees

Comments

@NatLee
Copy link

NatLee commented Jun 27, 2022

各位先進大家好

想請問預訓練的ernie-1.0是否能夠自行擴增vocab.txt

例如以下的tokenizer

import paddlenlp as ppnlp
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained("ernie-1.0")

我們是否可以再自行新增token?

查詢了一下issue表發現有人問ErnieGramTokenizer

#2022

但不知道ErnieForSequenceClassification是不是也無法自行擴增

謝謝!

@ZHUI
Copy link
Collaborator

ZHUI commented Jun 27, 2022

ernie-1.0 的词表中,有部分unused 的 token,如果你新加的token不多的话,可以试一试替换 unused

@ZHUI ZHUI self-assigned this Jun 27, 2022
@NatLee
Copy link
Author

NatLee commented Jun 27, 2022

@ZHUI 謝謝回覆!

可是unused的token只有九十多個,如果超過的話是不是就無法新增了?

@ZHUI
Copy link
Collaborator

ZHUI commented Jun 27, 2022

需要的话,可以自己 resize 一下 vocab, 这里有一个 resize_position_embeddings 的例子。 #2513

@NatLee
Copy link
Author

NatLee commented Jun 27, 2022

@ZHUI 謝謝回覆!

那個例子看起來是resize_position_embeddings的參數

我看預設的設置內,init_argsvocab_size只有18000

image

想請問就是如果我取用預訓練模型,我有辦法去更改這個設置嗎?

謝謝!

@ZHUI
Copy link
Collaborator

ZHUI commented Jun 27, 2022

没有关系的,这里是重新赋值了一遍 embedding

self.embeddings.position_embeddings = nn.Embedding(
self.config["max_position_embeddings"], self.config["hidden_size"])
with paddle.no_grad():
if num_position_embeds_diff > 0:
self.embeddings.position_embeddings.weight[:
-num_position_embeds_diff] = old_position_embeddings_weight
else:
self.embeddings.position_embeddings.weight = old_position_embeddings_weight[:
num_position_embeds_diff]

@NatLee
Copy link
Author

NatLee commented Jun 27, 2022

@ZHUI 這個功能大約什麼時候會被merge進主分支呢?

@ZHUI
Copy link
Collaborator

ZHUI commented Jun 27, 2022

抱歉,可以试一下这个 https://github.com/PaddlePaddle/PaddleNLP/pull/2423/files
resize_token_embeddings

@NatLee
Copy link
Author

NatLee commented Jun 28, 2022

@ZHUI 這個在develop分支,之後預計會release?

@ZHUI
Copy link
Collaborator

ZHUI commented Jun 28, 2022

本周内应该会有release

@NatLee NatLee closed this as completed Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants