transformer 实现 WWM 来 Fine-tune Bert 相关问题 #153

wlhgtc · 2020-10-24T12:35:54Z

您好，非常感谢您这边的工作。
我最近在 transformers 中实现了 WWM 的 Fine-tune，相关 PR 已经被 merge 到 master 分支，具体可以参见：这里。

具体的实现参考了 #13 中的描述：对于BERT 的分词结果，使用 LTP 的分词结果，对特定位置的索引添加 ##，后面 mask 部分代码就可以直接套用 Google 代码。

有两个细节上的小问题，希望您不吝解答：

LTP 的分词结果是否是动态的？因为我看是分词结果是有隐层向量输出的，而不是像 Jieba 一样是基于词表。
如果想要 Fine-tune，对数据的清洗应该做到什么程度，或者是否可以提供一下参考的文献？

The text was updated successfully, but these errors were encountered:

ymcui · 2020-10-25T05:21:20Z

对于确定的文本序列，分词结果唯一。
我理解你这里提到的Fine-tune应该是二次预训练的意思吧？不同数据源的处理方式略有不同，大多都是：1）去掉标签；2）去掉特殊字符&不可见字符；3）篇章中出现较多非自然文本的时候，尽量不用该篇章；

你可以参考FAIR提出的cc_net，或许会有一些输出处理的启发。
https://github.com/facebookresearch/cc_net/blob/master/README.md

wlhgtc · 2020-10-25T05:31:17Z

感谢您的回复~

关于分词，之前实现的时候考虑用 Trie 树存储 LTP 的词表，然后直接在 BERT 分词的时候加载这个树。但是看到 LTP 输出有模型向量，就没有这么做。只能把文本预先用 LTP 处理好，再标定位置。对 LTP 分词的速度不是很清楚，能否直接在读数据的时候预处理（体感单条处理的时候是有点慢的）？
是二次预训练。关于文本预处处理的，读完推荐的文章后再做咨询！

ymcui · 2020-10-25T07:51:38Z

TF原版代码是先提前生成好tfrecord（可直接送入神经网络的数据），训练过程中不会再online做tokenizaition。速度方面你可以根据实际情况选用适合你的中文分词工具。
文本预处理方面建议你不要对原始数据做太多normalization，避免破坏文本本来的特性。

stale · 2020-10-29T08:04:16Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-11-02T08:30:07Z

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

KenHung · 2020-12-31T05:03:03Z

@ymcui 您好，关于二次训练中使用LTP，有一个细节问题想问：应该要使用哪一个版本的LTP呢？
最新的LTP 4的发布时间比这里的中文模型还要新，应该用LTP 3.4吗？

ymcui · 2020-12-31T05:13:49Z

@ymcui 您好，关于二次训练中使用LTP，有一个细节问题想问：应该要使用哪一个版本的LTP呢？
最新的LTP 4的发布时间比这里的中文模型还要新，应该用LTP 3.4吗？

@KenHung 是的，当时使用的是3.4的版本。不过现在有新版本的话也可以尝试用新版本，或许会有进一步性能提升。

stale bot added the stale label Oct 29, 2020

stale bot closed this as completed Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformer 实现 WWM 来 Fine-tune Bert 相关问题 #153

transformer 实现 WWM 来 Fine-tune Bert 相关问题 #153

wlhgtc commented Oct 24, 2020

ymcui commented Oct 25, 2020

wlhgtc commented Oct 25, 2020 •

edited

Loading

ymcui commented Oct 25, 2020

stale bot commented Oct 29, 2020

stale bot commented Nov 2, 2020

KenHung commented Dec 31, 2020

ymcui commented Dec 31, 2020 •

edited

Loading

transformer 实现 WWM 来 Fine-tune Bert 相关问题 #153

transformer 实现 WWM 来 Fine-tune Bert 相关问题 #153

Comments

wlhgtc commented Oct 24, 2020

ymcui commented Oct 25, 2020

wlhgtc commented Oct 25, 2020 • edited Loading

ymcui commented Oct 25, 2020

stale bot commented Oct 29, 2020

stale bot commented Nov 2, 2020

KenHung commented Dec 31, 2020

ymcui commented Dec 31, 2020 • edited Loading

wlhgtc commented Oct 25, 2020 •

edited

Loading

ymcui commented Dec 31, 2020 •

edited

Loading