Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PaddlePaddle Hackathon] 第51题 #1115

Merged
merged 31 commits into from
Oct 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5dc08f9
add bert japanese
iverxin Oct 4, 2021
4e43acd
Merge branch 'develop' into t51
yingyibiao Oct 18, 2021
ff5db0c
fix model-weight files position
iverxin Oct 19, 2021
6855b34
Merge remote-tracking branch 'origin/t51' into t51
iverxin Oct 19, 2021
5392945
add weights files url
iverxin Oct 19, 2021
3e647af
create package: bert_japanese
iverxin Oct 20, 2021
f299d94
Merge branch 'develop' into t51
iverxin Oct 20, 2021
2ca28b4
update weights readme
iverxin Oct 20, 2021
0621950
Merge remote-tracking branch 'origin/t51' into t51
iverxin Oct 20, 2021
dff2c5c
update weights files
iverxin Oct 20, 2021
22bd7c4
update config pretrain weights https
iverxin Oct 20, 2021
de0f7bf
修复权重配置文件
iverxin Oct 21, 2021
c74338a
retest CI
iverxin Oct 22, 2021
109db2f
Merge branch 'develop' into t51
yingyibiao Oct 24, 2021
919fcb6
update
iverxin Oct 24, 2021
c3942ce
update
iverxin Oct 24, 2021
6177a03
fix docstring
iverxin Oct 25, 2021
ed9c933
update
iverxin Oct 25, 2021
b494f71
预训练权重更新
iverxin Oct 25, 2021
b096712
update weights readme
iverxin Oct 26, 2021
f85adb5
remove weights url in codes
iverxin Oct 26, 2021
1fb1e1e
update...
iverxin Oct 26, 2021
76b9094
Merge branch 'develop' into t51
iverxin Oct 26, 2021
bc55f1e
update...
iverxin Oct 26, 2021
eec73dd
update weights readme
iverxin Oct 27, 2021
2858749
update
iverxin Oct 27, 2021
7abd183
Merge branch 'develop' into t51
iverxin Oct 27, 2021
00b6b1d
update
iverxin Oct 27, 2021
c47c473
update docstring
iverxin Oct 27, 2021
5734afd
清理冗余代码
iverxin Oct 27, 2021
d943f52
Merge branch 'develop' into t51
yingyibiao Oct 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@


# BERT base Japanese (character tokenization, whole word masking enabled)

This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.

Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.

The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019.

To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.

The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters.

The vocabulary size is 4000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

## Acknowledgments

For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.

## Usage
```python
import paddle
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM

path = "iverxin/bert-base-japanese-char-whole-word-masking/"
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
model = BertForMaskedLM.from_pretrained(path)
text1 = "こんにちは"

model.eval()
inputs = tokenizer(text1)
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
print(output.shape)
# [1, 5, 32000]
```

## Weights source
https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_config.json",
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_state.pdparams",
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/tokenizer_config.pdparams",
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/vocab.txt"
}
60 changes: 60 additions & 0 deletions community/iverxin/bert-base-japanese-char/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@


# BERT base Japanese (character tokenization)

This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.

The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019.

To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.

The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters.

The vocabulary size is 4000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

## Acknowledgments

For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.

## Usage
```python
import paddle
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM, MecabTokenizer

path = "iverxin/bert-base-japanese-char/"
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
model = BertForMaskedLM.from_pretrained(path)
text1 = "こんにちは"
text2 = "櫓を飛ばす"

model.eval()
inputs = tokenizer(text1)
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
print(output.shape)
```

## Weights source
https://huggingface.co/cl-tohoku/bert-base-japanese-char
6 changes: 6 additions & 0 deletions community/iverxin/bert-base-japanese-char/files.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_config.json",
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_state.pdparams",
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/tokenizer_config.pdparams",
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/vocab.txt"
}
63 changes: 63 additions & 0 deletions community/iverxin/bert-base-japanese-whole-word-masking/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@


# BERT base Japanese (IPA dictionary, whole word masking enabled)

This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.

Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.

The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019.

To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.

The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.

The vocabulary size is 32000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

## Acknowledgments

For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.

## Usage
```python
import paddle
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM

path = "iverxin/bert-base-japanese-whole-word-masking/"
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
model = BertForMaskedLM.from_pretrained(path)
text1 = "こんにちは"

model.eval()
inputs = tokenizer(text1)
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
print(output.shape)
```

## Weights source
https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_config.json",
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_state.pdparams",
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/tokenizer_config.pdparams",
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/vocab.txt"
}
59 changes: 59 additions & 0 deletions community/iverxin/bert-base-japanese/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# BERT base Japanese (IPA dictionary)

This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.

The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019.

To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.

The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.

The vocabulary size is 32000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

## Acknowledgments

For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.


## Usage
```python
import paddle
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM

path = "iverxin/bert-base-japanese/"
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
model = BertForMaskedLM.from_pretrained(path)
text1 = "こんにちは"

model.eval()
inputs = tokenizer(text1)
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
print(output.shape)
```


## Weights source
https://huggingface.co/cl-tohoku/bert-base-japanese
6 changes: 6 additions & 0 deletions community/iverxin/bert-base-japanese/files.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_config.json",
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_state.pdparams",
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/tokenizer_config.pdparams",
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/vocab.txt"
}
1 change: 1 addition & 0 deletions paddlenlp/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

from .bert.modeling import *
from .bert.tokenizer import *
from .bert_japanese.tokenizer import *
from .ernie.modeling import *
from .ernie.tokenizer import *
from .gpt.modeling import *
Expand Down
15 changes: 8 additions & 7 deletions paddlenlp/transformers/bert/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,17 @@
# limitations under the License.

import copy
import io
import json
import os
import six
import unicodedata

from .. import PretrainedTokenizer
from ..tokenizer_utils import convert_to_unicode, whitespace_tokenize, _is_whitespace, _is_control, _is_punctuation

__all__ = ['BasicTokenizer', 'BertTokenizer', 'WordpieceTokenizer']
__all__ = [
'BasicTokenizer',
'BertTokenizer',
'WordpieceTokenizer',
]


class BasicTokenizer(object):
Expand Down Expand Up @@ -290,9 +291,9 @@ class BertTokenizer(PretrainedTokenizer):
.. code-block::

from paddlenlp.transformers import BertTokenizer
berttokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

inputs = berttokenizer.tokenize('He was a puppeteer')
inputs = tokenizer('He was a puppeteer')
print(inputs)

'''
Expand Down Expand Up @@ -554,7 +555,7 @@ def create_token_type_ids_from_sequences(self,
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |

If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

Args:
token_ids_0 (List[int]):
Expand Down
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import paddle
import torch
import numpy as np
from paddle.utils.download import get_path_from_url

model_names = [
"bert-base-japanese", "bert-base-japanese-whole-word-masking",
"bert-base-japanese-char", "bert-base-japanese-char-whole-word-masking"
]

for model_name in model_names:
torch_model_url = "https://huggingface.co/cl-tohoku/%s/resolve/main/pytorch_model.bin" % model_name
torch_model_path = get_path_from_url(torch_model_url, '../bert')
torch_state_dict = torch.load(torch_model_path)

paddle_model_path = "%s.pdparams" % model_name
paddle_state_dict = {}

# State_dict's keys mapping: from torch to paddle
keys_dict = {
# about embeddings
"embeddings.LayerNorm.gamma": "embeddings.layer_norm.weight",
"embeddings.LayerNorm.beta": "embeddings.layer_norm.bias",

# about encoder layer
'encoder.layer': 'encoder.layers',
'attention.self.query': 'self_attn.q_proj',
'attention.self.key': 'self_attn.k_proj',
'attention.self.value': 'self_attn.v_proj',
'attention.output.dense': 'self_attn.out_proj',
'attention.output.LayerNorm.gamma': 'norm1.weight',
'attention.output.LayerNorm.beta': 'norm1.bias',
'intermediate.dense': 'linear1',
'output.dense': 'linear2',
'output.LayerNorm.gamma': 'norm2.weight',
'output.LayerNorm.beta': 'norm2.bias',

# about cls predictions
'cls.predictions.transform.dense': 'cls.predictions.transform',
'cls.predictions.decoder.weight': 'cls.predictions.decoder_weight',
'cls.predictions.transform.LayerNorm.gamma':
'cls.predictions.layer_norm.weight',
'cls.predictions.transform.LayerNorm.beta':
'cls.predictions.layer_norm.bias',
'cls.predictions.bias': 'cls.predictions.decoder_bias'
}

for torch_key in torch_state_dict:
paddle_key = torch_key
for k in keys_dict:
if k in paddle_key:
paddle_key = paddle_key.replace(k, keys_dict[k])

if ('linear' in paddle_key) or ('proj' in paddle_key) or (
'vocab' in paddle_key and 'weight' in paddle_key) or (
"dense.weight" in paddle_key) or (
'transform.weight' in paddle_key) or (
'seq_relationship.weight' in paddle_key):
paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[
torch_key].cpu().numpy().transpose())
else:
paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[
torch_key].cpu().numpy())

print("torch: ", torch_key, "\t", torch_state_dict[torch_key].shape)
print("paddle: ", paddle_key, "\t", paddle_state_dict[paddle_key].shape,
"\n")

paddle.save(paddle_state_dict, paddle_model_path)
Loading