Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PaddlePaddle Hackathon] 第51题 #1115

Merged
merged 31 commits into from
Oct 28, 2021
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5dc08f9
add bert japanese
iverxin Oct 4, 2021
4e43acd
Merge branch 'develop' into t51
yingyibiao Oct 18, 2021
ff5db0c
fix model-weight files position
iverxin Oct 19, 2021
6855b34
Merge remote-tracking branch 'origin/t51' into t51
iverxin Oct 19, 2021
5392945
add weights files url
iverxin Oct 19, 2021
3e647af
create package: bert_japanese
iverxin Oct 20, 2021
f299d94
Merge branch 'develop' into t51
iverxin Oct 20, 2021
2ca28b4
update weights readme
iverxin Oct 20, 2021
0621950
Merge remote-tracking branch 'origin/t51' into t51
iverxin Oct 20, 2021
dff2c5c
update weights files
iverxin Oct 20, 2021
22bd7c4
update config pretrain weights https
iverxin Oct 20, 2021
de0f7bf
修复权重配置文件
iverxin Oct 21, 2021
c74338a
retest CI
iverxin Oct 22, 2021
109db2f
Merge branch 'develop' into t51
yingyibiao Oct 24, 2021
919fcb6
update
iverxin Oct 24, 2021
c3942ce
update
iverxin Oct 24, 2021
6177a03
fix docstring
iverxin Oct 25, 2021
ed9c933
update
iverxin Oct 25, 2021
b494f71
预训练权重更新
iverxin Oct 25, 2021
b096712
update weights readme
iverxin Oct 26, 2021
f85adb5
remove weights url in codes
iverxin Oct 26, 2021
1fb1e1e
update...
iverxin Oct 26, 2021
76b9094
Merge branch 'develop' into t51
iverxin Oct 26, 2021
bc55f1e
update...
iverxin Oct 26, 2021
eec73dd
update weights readme
iverxin Oct 27, 2021
2858749
update
iverxin Oct 27, 2021
7abd183
Merge branch 'develop' into t51
iverxin Oct 27, 2021
00b6b1d
update
iverxin Oct 27, 2021
c47c473
update docstring
iverxin Oct 27, 2021
5734afd
清理冗余代码
iverxin Oct 27, 2021
d943f52
Merge branch 'develop' into t51
yingyibiao Oct 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## bert-base-japanese
12 repeating layers, 768-hidden, 12-heads.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
[reference](https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking)
iverxin marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"bert-base-japanese-char-whole-word-masking": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/bert-base-japanese-char-whole-word-masking.pdparams"
}
iverxin marked this conversation as resolved.
Show resolved Hide resolved
iverxin marked this conversation as resolved.
Show resolved Hide resolved
5 changes: 5 additions & 0 deletions community/iverxin/bert-base-japanese-char/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## bert-base-japanese
12 repeating layers, 768-hidden, 12-heads.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.
[reference](https://huggingface.co/cl-tohoku/bert-base-japanese-char)
3 changes: 3 additions & 0 deletions community/iverxin/bert-base-japanese-char/files.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"bert-base-japanese-char": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/bert-base-japanese-char.pdparams"
}
iverxin marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## bert-base-japanese
12 repeating layers, 768-hidden, 12-heads.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization. Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.
[reference](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking)
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"bert-base-japanese-whole-word-masking": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/bert-base-japanese-whole-word-masking.pdparams"
}
iverxin marked this conversation as resolved.
Show resolved Hide resolved
6 changes: 6 additions & 0 deletions community/iverxin/bert-base-japanese/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## bert-base-japanese
12 repeating layers, 768-hidden, 12-heads.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.
[reference](https://huggingface.co/cl-tohoku/bert-base-japanese)

3 changes: 3 additions & 0 deletions community/iverxin/bert-base-japanese/files.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"bert-base-japanese": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/bert-base-japanese.pdparams"
}
iverxin marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions paddlenlp/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

from .bert.modeling import *
from .bert.tokenizer import *
from .bert_japanese.tokenizer import *
from .ernie.modeling import *
from .ernie.tokenizer import *
from .gpt.modeling import *
Expand Down
64 changes: 64 additions & 0 deletions paddlenlp/transformers/bert/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,62 @@ class BertPretrainedModel(PretrainedModel):
"initializer_range": 0.02,
"pad_token_id": 0,
},
"bert-base-japanese": {
"vocab_size": 32000,
"hidden_size": 768,
"num_hidden_layers": 12,
"num_attention_heads": 12,
"intermediate_size": 3072,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"max_position_embeddings": 512,
"type_vocab_size": 2,
"initializer_range": 0.02,
"pad_token_id": 0,
},
"bert-base-japanese-whole-word-masking": {
"vocab_size": 30522,
"hidden_size": 768,
"num_hidden_layers": 12,
"num_attention_heads": 12,
"intermediate_size": 3072,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"max_position_embeddings": 512,
"type_vocab_size": 2,
"initializer_range": 0.02,
"pad_token_id": 0,
},
"bert-base-japanese-char ": {
"vocab_size": 4000,
"hidden_size": 768,
"num_hidden_layers": 12,
"num_attention_heads": 12,
"intermediate_size": 3072,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"max_position_embeddings": 512,
"type_vocab_size": 2,
"initializer_range": 0.02,
"pad_token_id": 0,
},
"bert-base-japanese-char-whole-word-masking": {
"vocab_size": 4000,
"hidden_size": 768,
"num_hidden_layers": 12,
"num_attention_heads": 12,
"intermediate_size": 3072,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"max_position_embeddings": 512,
"type_vocab_size": 2,
"initializer_range": 0.02,
"pad_token_id": 0,
}
iverxin marked this conversation as resolved.
Show resolved Hide resolved
iverxin marked this conversation as resolved.
Show resolved Hide resolved
}
resource_files_names = {"model_state": "model_state.pdparams"}
pretrained_resource_files_map = {
Expand Down Expand Up @@ -298,6 +354,14 @@ class BertPretrainedModel(PretrainedModel):
"https://paddlenlp.bj.bcebos.com/models/transformers/macbert/macbert-large-chinese.pdparams",
"simbert-base-chinese":
"https://paddlenlp.bj.bcebos.com/models/transformers/simbert/simbert-base-chinese-v1.pdparams",
"bert-base-japanese":
"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/bert-base-japanese.pdparams",
"bert-base-japanese-whole-word-masking":
"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/bert-base-japanese-whole-word-masking.pdparams",
"bert-base-japanese-char":
"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/bert-base-japanese-char.pdparams",
"bert-base-japanese-char-whole-word-masking":
"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/bert-base-japanese-char-whole-word-masking.pdparams",
}
}
base_model_prefix = "bert"
Expand Down
42 changes: 35 additions & 7 deletions paddlenlp/transformers/bert/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,16 @@
# limitations under the License.

import copy
import io
import json
import os
import six
import unicodedata
import collections
iverxin marked this conversation as resolved.
Show resolved Hide resolved

from .. import PretrainedTokenizer
from ..tokenizer_utils import convert_to_unicode, whitespace_tokenize, _is_whitespace, _is_control, _is_punctuation

__all__ = ['BasicTokenizer', 'BertTokenizer', 'WordpieceTokenizer']
__all__ = [
'BasicTokenizer', 'BertTokenizer', 'WordpieceTokenizer',
]


class BasicTokenizer(object):
Expand Down Expand Up @@ -296,7 +296,7 @@ class BertTokenizer(PretrainedTokenizer):
print(inputs)

'''
{'input_ids': [101, 2002, 2001, 1037, 13997, 11510, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}
['he', 'was', 'a', 'puppet', '##eer']
iverxin marked this conversation as resolved.
Show resolved Hide resolved
iverxin marked this conversation as resolved.
Show resolved Hide resolved
'''

"""
Expand Down Expand Up @@ -327,6 +327,14 @@ class BertTokenizer(PretrainedTokenizer):
"https://paddle-hapi.bj.bcebos.com/models/bert/bert-base-chinese-vocab.txt",
"simbert-base-chinese":
"https://paddlenlp.bj.bcebos.com/models/transformers/simbert/vocab.txt",
"bert-base-japanese":
"https://huggingface.co/cl-tohoku/bert-base-japanese/resolve/main/vocab.txt",
"bert-base-japanese-whole-word-masking":
"https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking/resolve/main/vocab.txt",
"bert-base-japanese-char":
"https://huggingface.co/cl-tohoku/bert-base-japanese-char/resolve/main/vocab.txt",
"bert-base-japanese-char-whole-word-masking":
"https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking/resolve/main/vocab.txt"
}
}
pretrained_init_configuration = {
Expand Down Expand Up @@ -366,6 +374,26 @@ class BertTokenizer(PretrainedTokenizer):
"simbert-base-chinese": {
"do_lower_case": True
},
"bert-base-japanese": {
"do_lower_case": False,
"word_tokenizer_type": "mecab",
"subword_tokenizer_type": "wordpiece",
},
"bert-base-japanese-whole-word-masking": {
"do_lower_case": False,
"word_tokenizer_type": "mecab",
"subword_tokenizer_type": "wordpiece",
},
"bert-base-japanese-char": {
"do_lower_case": False,
"word_tokenizer_type": "mecab",
"subword_tokenizer_type": "character",
},
"bert-base-japanese-char-whole-word-masking": {
"do_lower_case": False,
"word_tokenizer_type": "mecab",
"subword_tokenizer_type": "character",
},
iverxin marked this conversation as resolved.
Show resolved Hide resolved
}
padding_side = 'right'

Expand Down Expand Up @@ -554,7 +582,7 @@ def create_token_type_ids_from_sequences(self,
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |

If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

Args:
token_ids_0 (List[int]):
Expand Down Expand Up @@ -605,4 +633,4 @@ def get_special_tokens_mask(self,
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + (
[0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
iverxin marked this conversation as resolved.
Show resolved Hide resolved
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import paddle
import torch
import numpy as np
from paddle.utils.download import get_path_from_url

model_names = [
"bert-base-japanese", "bert-base-japanese-whole-word-masking",
"bert-base-japanese-char", "bert-base-japanese-char-whole-word-masking"
]

for model_name in model_names:
torch_model_url = "https://huggingface.co/cl-tohoku/%s/resolve/main/pytorch_model.bin" % model_name
torch_model_path = get_path_from_url(torch_model_url, '../bert')
torch_state_dict = torch.load(torch_model_path)

paddle_model_path = "%s.pdparams" % model_name
paddle_state_dict = {}

# State_dict's keys mapping: from torch to paddle
keys_dict = {
# about embeddings
"embeddings.LayerNorm.gamma": "embeddings.layer_norm.weight",
"embeddings.LayerNorm.beta": "embeddings.layer_norm.bias",

# about encoder layer
'encoder.layer': 'encoder.layers',
'attention.self.query': 'self_attn.q_proj',
'attention.self.key': 'self_attn.k_proj',
'attention.self.value': 'self_attn.v_proj',
'attention.output.dense': 'self_attn.out_proj',
'attention.output.LayerNorm.gamma': 'norm1.weight',
'attention.output.LayerNorm.beta': 'norm1.bias',
'intermediate.dense': 'linear1',
'output.dense': 'linear2',
'output.LayerNorm.gamma': 'norm2.weight',
'output.LayerNorm.beta': 'norm2.bias',

# about cls predictions
'cls.predictions.transform.dense': 'cls.predictions.transform',
'cls.predictions.decoder.weight': 'cls.predictions.decoder_weight',
'cls.predictions.transform.LayerNorm.gamma':
'cls.predictions.layer_norm.weight',
'cls.predictions.transform.LayerNorm.beta':
'cls.predictions.layer_norm.bias',
'cls.predictions.bias': 'cls.predictions.decoder_bias'
}

for torch_key in torch_state_dict:
paddle_key = torch_key
for k in keys_dict:
if k in paddle_key:
paddle_key = paddle_key.replace(k, keys_dict[k])

if ('linear' in paddle_key) or ('proj' in paddle_key) or (
'vocab' in paddle_key and 'weight' in paddle_key) or (
"dense.weight" in paddle_key) or (
'transform.weight' in paddle_key) or (
'seq_relationship.weight' in paddle_key):
paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[
torch_key].cpu().numpy().transpose())
else:
paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[
torch_key].cpu().numpy())

print("torch: ", torch_key, "\t", torch_state_dict[torch_key].shape)
print("paddle: ", paddle_key, "\t", paddle_state_dict[paddle_key].shape,
"\n")

paddle.save(paddle_state_dict, paddle_model_path)
Loading