Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a special tokenizer for CPM model #11068

Merged
merged 15 commits into from
Apr 9, 2021
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ jobs:
- v0.4-custom_tokenizers-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[ja,testing,sentencepiece]
- run: pip install .[ja,testing,sentencepiece,jieba]
- run: python -m unidic download
- save_cache:
key: v0.4-custom_tokenizers-{{ checksum "setup.py" }}
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -429,6 +429,7 @@ TensorFlow and/or Flax.
model_doc/bort
model_doc/camembert
model_doc/convbert
model_doc/cpm
model_doc/ctrl
model_doc/deberta
model_doc/deberta_v2
Expand Down
44 changes: 44 additions & 0 deletions docs/source/model_doc/cpm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

CPM
-----------------------------------------------------------------------------------------------------------------------

Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you start the overview by mentioning the paper name & the authors of that paper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please mention the abstract below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like what's done in BERT for example


The CPM model was proposed in `CPM: A Large-scale Generative Chinese Pre-trained Language Model
<https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen,
Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.

The abstract from the paper is the following:

*Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3,
with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even
zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus
of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the
Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best
of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained
language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation,
cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
NLP tasks in the settings of few-shot (even zero-shot) learning.*

The original implementation can be found here: https://github.com/TsinghuaAI/CPM-Generate

Note: We only have a tokenizer here, since the model architecture is the same as GPT-2.

CpmTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.CpmTokenizer
:members:
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@
"isort>=5.5.4",
"jax>=0.2.8",
"jaxlib>=0.1.59",
"jieba",
"keras2onnx",
"numpy>=1.17",
"onnxconverter-common",
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@
],
"models.camembert": ["CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CamembertConfig"],
"models.convbert": ["CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConvBertConfig", "ConvBertTokenizer"],
"models.cpm": ["CpmTokenizer"],
"models.ctrl": ["CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP", "CTRLConfig", "CTRLTokenizer"],
"models.deberta": ["DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaConfig", "DebertaTokenizer"],
"models.deberta_v2": ["DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaV2Config"],
Expand Down Expand Up @@ -1454,6 +1455,7 @@
)
from .models.camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from .models.convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig, ConvBertTokenizer
from .models.cpm import CpmTokenizer
from .models.ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig, CTRLTokenizer
from .models.deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig, DebertaTokenizer
from .models.deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config
Expand Down
1 change: 1 addition & 0 deletions src/transformers/dependency_versions_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"isort": "isort>=5.5.4",
"jax": "jax>=0.2.8",
"jaxlib": "jaxlib>=0.1.59",
"jieba": "jieba",
"keras2onnx": "keras2onnx",
"numpy": "numpy>=1.17",
"onnxconverter-common": "onnxconverter-common",
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
blenderbot_small,
camembert,
convbert,
cpm,
ctrl,
deberta,
dialogpt,
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@
from ..bert_generation.tokenization_bert_generation import BertGenerationTokenizer
from ..big_bird.tokenization_big_bird import BigBirdTokenizer
from ..camembert.tokenization_camembert import CamembertTokenizer
from ..cpm.tokenization_cpm import CpmTokenizer
from ..deberta_v2.tokenization_deberta_v2 import DebertaV2Tokenizer
from ..m2m_100 import M2M100Tokenizer
from ..marian.tokenization_marian import MarianTokenizer
Expand All @@ -134,6 +135,7 @@
BertGenerationTokenizer = None
BigBirdTokenizer = None
CamembertTokenizer = None
CpmTokenizer = None
DebertaV2Tokenizer = None
MarianTokenizer = None
MBartTokenizer = None
Expand Down Expand Up @@ -273,6 +275,7 @@
NO_CONFIG_TOKENIZER = [
BertJapaneseTokenizer,
BertweetTokenizer,
CpmTokenizer,
HerbertTokenizer,
HerbertTokenizerFast,
PhobertTokenizer,
Expand Down
48 changes: 48 additions & 0 deletions src/transformers/models/cpm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.

# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import TYPE_CHECKING

from ...file_utils import _BaseLazyModule


_import_structure = {
"tokenization_cpm": ["CpmTokenizer"],
}


if TYPE_CHECKING:
from .tokenization_cpm import CpmTokenizer

else:
import importlib
import os
import sys

class _LazyModule(_BaseLazyModule):
"""
Module class that surfaces all objects but only performs associated imports when the objects are requested.
"""

__file__ = globals()["__file__"]
__path__ = [os.path.dirname(__file__)]

def _get_module(self, module_name: str):
return importlib.import_module("." + module_name, self.__name__)

sys.modules[__name__] = _LazyModule(__name__, _import_structure)
109 changes: 109 additions & 0 deletions src/transformers/models/cpm/tokenization_cpm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from ...utils import logging
from ..xlnet.tokenization_xlnet import XLNetTokenizer


logger = logging.get_logger(__name__)

VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"}

PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"TsinghuaAI/CPM-Generate": "https://huggingface.co/TsinghuaAI/CPM-Generate/resolve/main/spiece.model",
}
}


class CpmTokenizer(XLNetTokenizer):
"""Runs pre-tokenization with Jieba segmentation tool. It is used in CPM models."""

def __init__(self, *args, **kwargs):
"""
Construct a CPM tokenizer. Based on `Jieba <https://pypi.org/project/jieba/>` and `SentencePiece
<https://github.com/google/sentencepiece>`__.

This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (:obj:`str`):
`SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that
contains the vocabulary necessary to instantiate a tokenizer.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to lowercase the input when tokenizing.
remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to strip the text when tokenizing (removing excess spaces before and after the string).
keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to keep accents when tokenizing.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier
token.

.. note::

When building a sequence using special tokens, this is not the token that is used for the beginning
of sequence. The token used is the :obj:`cls_token`.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
The end of sequence token.

.. note::

When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the :obj:`sep_token`.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be
this token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"<sep>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
for sequence classification or for a text and a question for question answering. It is also used as the
last token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"<cls>"`):
The classifier token which is used when doing sequence classification (classification of the whole
sequence instead of per-token classification). It is the first token of the sequence when built with
special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<eop>", "<eod>"]`):
Additional special tokens used by the tokenizer.

Attributes:
sp_model (:obj:`SentencePieceProcessor`):
The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
"""
super().__init__(*args, **kwargs)
try:
import jieba
except ModuleNotFoundError as error:
raise error.__class__(
"You need to install jieba to use CpmTokenizer."
"See https://pypi.org/project/jieba/ for installation."
)
self.jieba = jieba
self.translator = str.maketrans(" \n", "\u2582\u2583")

def _tokenize(self, text, *args, **kwargs):
text = [x.translate(self.translator) for x in self.jieba.cut(text, cut_all=False)]
text = " ".join(text)
return super()._tokenize(text, *args, **kwargs)

def _decode(self, *args, **kwargs):
text = super()._decode(*args, **kwargs)
text = text.replace(" ", "").replace("\u2582", " ").replace("\u2583", "\n")
return text
39 changes: 39 additions & 0 deletions tests/test_tokenization_cpm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# coding=utf-8
# Copyright 2018 HuggingFace Inc. team.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


from transformers.models.cpm.tokenization_cpm import CpmTokenizer
from transformers.testing_utils import custom_tokenizers

from .test_modeling_xlnet import XLNetModelTest


@custom_tokenizers
class CpmTokenizationTest(XLNetModelTest):
def test_pre_tokenization(self):
tokenizer = CpmTokenizer.from_pretrained("TsinghuaAI/CPM-Generate")
text = "Hugging Face大法好,谁用谁知道。"
normalized_text = "Hugging Face大法好,谁用谁知道。<unk>"
bpe_tokens = "▁Hu gg ing ▁ ▂ ▁F ace ▁大法 ▁好 ▁ , ▁谁 ▁用 ▁谁 ▁知 道 ▁ 。".split()

tokens = tokenizer.tokenize(text)
self.assertListEqual(tokens, bpe_tokens)

input_tokens = tokens + [tokenizer.unk_token]

input_bpe_tokens = [13789, 13283, 1421, 8, 10, 1164, 13608, 16528, 63, 8, 9, 440, 108, 440, 121, 90, 8, 12, 0]
self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)

reconstructed_text = tokenizer.decode(input_bpe_tokens)
self.assertEqual(reconstructed_text, normalized_text)