-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a special tokenizer for CPM model #11068
Merged
Merged
Changes from 12 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
9e99fa7
Add a special tokenizer for CPM model
JetRunner 7f8eb83
make style
JetRunner 1b3924b
fix
JetRunner 60d7633
Add docs
JetRunner 6b78cfb
styles
JetRunner 58014eb
cpm doc
JetRunner 18bd9ac
fix ci
JetRunner 715d9fd
fix the overview
JetRunner 0ebec18
add test
JetRunner df81742
make style
JetRunner aba540f
typo
JetRunner 5ba15e4
Custom tokenizer flag
LysandreJik e5621df
Add REAMDE.md
JetRunner 47bd6ef
Merge branch 'cpm_tokenizer' of github.com:huggingface/transformers i…
JetRunner 1d47bc5
Merge branch 'master' into cpm_tokenizer
JetRunner File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
.. | ||
Copyright 2020 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
|
||
CPM | ||
----------------------------------------------------------------------------------------------------------------------- | ||
|
||
Overview | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The CPM model was proposed in `CPM: A Large-scale Generative Chinese Pre-trained Language Model | ||
<https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, | ||
Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, | ||
Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. | ||
|
||
The abstract from the paper is the following: | ||
|
||
*Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, | ||
with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even | ||
zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus | ||
of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the | ||
Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best | ||
of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained | ||
language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, | ||
cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many | ||
NLP tasks in the settings of few-shot (even zero-shot) learning.* | ||
|
||
The original implementation can be found here: https://github.com/TsinghuaAI/CPM-Generate | ||
|
||
Note: We only have a tokenizer here, since the model architecture is the same as GPT-2. | ||
|
||
CpmTokenizer | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. autoclass:: transformers.CpmTokenizer | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,6 +30,7 @@ | |
blenderbot_small, | ||
camembert, | ||
convbert, | ||
cpm, | ||
ctrl, | ||
deberta, | ||
dialogpt, | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# flake8: noqa | ||
# There's no way to ignore "F401 '...' imported but unused" warnings in this | ||
# module, but to preserve other warnings. So, don't check this module at all. | ||
|
||
# Copyright 2020 The HuggingFace Team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from typing import TYPE_CHECKING | ||
|
||
from ...file_utils import _BaseLazyModule | ||
|
||
|
||
_import_structure = { | ||
"tokenization_cpm": ["CpmTokenizer"], | ||
} | ||
|
||
|
||
if TYPE_CHECKING: | ||
from .tokenization_cpm import CpmTokenizer | ||
|
||
else: | ||
import importlib | ||
import os | ||
import sys | ||
|
||
class _LazyModule(_BaseLazyModule): | ||
""" | ||
Module class that surfaces all objects but only performs associated imports when the objects are requested. | ||
""" | ||
|
||
__file__ = globals()["__file__"] | ||
__path__ = [os.path.dirname(__file__)] | ||
|
||
def _get_module(self, module_name: str): | ||
return importlib.import_module("." + module_name, self.__name__) | ||
|
||
sys.modules[__name__] = _LazyModule(__name__, _import_structure) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# coding=utf-8 | ||
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""Tokenization classes.""" | ||
from ...utils import logging | ||
from ..xlnet.tokenization_xlnet import XLNetTokenizer | ||
|
||
|
||
logger = logging.get_logger(__name__) | ||
|
||
VOCAB_FILES_NAMES = {"vocab_file": "spiece.model"} | ||
|
||
PRETRAINED_VOCAB_FILES_MAP = { | ||
"vocab_file": { | ||
"TsinghuaAI/CPM-Generate": "https://huggingface.co/TsinghuaAI/CPM-Generate/resolve/main/spiece.model", | ||
} | ||
} | ||
|
||
|
||
class CpmTokenizer(XLNetTokenizer): | ||
"""Runs pre-tokenization with Jieba segmentation tool. It is used in CPM models.""" | ||
|
||
def __init__(self, *args, **kwargs): | ||
""" | ||
Construct a CPM tokenizer. Based on `Jieba <https://pypi.org/project/jieba/>` and `SentencePiece | ||
<https://github.com/google/sentencepiece>`__. | ||
|
||
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main | ||
methods. Users should refer to this superclass for more information regarding those methods. | ||
|
||
Args: | ||
vocab_file (:obj:`str`): | ||
`SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that | ||
contains the vocabulary necessary to instantiate a tokenizer. | ||
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): | ||
Whether to lowercase the input when tokenizing. | ||
remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`): | ||
Whether to strip the text when tokenizing (removing excess spaces before and after the string). | ||
keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`): | ||
Whether to keep accents when tokenizing. | ||
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`): | ||
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier | ||
token. | ||
|
||
.. note:: | ||
|
||
When building a sequence using special tokens, this is not the token that is used for the beginning | ||
of sequence. The token used is the :obj:`cls_token`. | ||
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`): | ||
The end of sequence token. | ||
|
||
.. note:: | ||
|
||
When building a sequence using special tokens, this is not the token that is used for the end of | ||
sequence. The token used is the :obj:`sep_token`. | ||
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`): | ||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be | ||
this token instead. | ||
sep_token (:obj:`str`, `optional`, defaults to :obj:`"<sep>"`): | ||
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences | ||
for sequence classification or for a text and a question for question answering. It is also used as the | ||
last token of a sequence built with special tokens. | ||
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`): | ||
The token used for padding, for example when batching sequences of different lengths. | ||
cls_token (:obj:`str`, `optional`, defaults to :obj:`"<cls>"`): | ||
The classifier token which is used when doing sequence classification (classification of the whole | ||
sequence instead of per-token classification). It is the first token of the sequence when built with | ||
special tokens. | ||
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`): | ||
The token used for masking values. This is the token used when training this model with masked language | ||
modeling. This is the token which the model will try to predict. | ||
additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<eop>", "<eod>"]`): | ||
Additional special tokens used by the tokenizer. | ||
|
||
Attributes: | ||
sp_model (:obj:`SentencePieceProcessor`): | ||
The `SentencePiece` processor that is used for every conversion (string, tokens and IDs). | ||
""" | ||
super().__init__(*args, **kwargs) | ||
try: | ||
import jieba | ||
except ModuleNotFoundError as error: | ||
raise error.__class__( | ||
"You need to install jieba to use CpmTokenizer." | ||
"See https://pypi.org/project/jieba/ for installation." | ||
) | ||
self.jieba = jieba | ||
self.translator = str.maketrans(" \n", "\u2582\u2583") | ||
|
||
def _tokenize(self, text, *args, **kwargs): | ||
text = [x.translate(self.translator) for x in self.jieba.cut(text, cut_all=False)] | ||
text = " ".join(text) | ||
return super()._tokenize(text, *args, **kwargs) | ||
|
||
def _decode(self, *args, **kwargs): | ||
text = super()._decode(*args, **kwargs) | ||
text = text.replace(" ", "").replace("\u2582", " ").replace("\u2583", "\n") | ||
return text |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# coding=utf-8 | ||
# Copyright 2018 HuggingFace Inc. team. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
|
||
from transformers.models.cpm.tokenization_cpm import CpmTokenizer | ||
from transformers.testing_utils import custom_tokenizers | ||
|
||
from .test_modeling_xlnet import XLNetModelTest | ||
|
||
|
||
@custom_tokenizers | ||
class CpmTokenizationTest(XLNetModelTest): | ||
def test_pre_tokenization(self): | ||
tokenizer = CpmTokenizer.from_pretrained("TsinghuaAI/CPM-Generate") | ||
text = "Hugging Face大法好,谁用谁知道。" | ||
normalized_text = "Hugging Face大法好,谁用谁知道。<unk>" | ||
bpe_tokens = "▁Hu gg ing ▁ ▂ ▁F ace ▁大法 ▁好 ▁ , ▁谁 ▁用 ▁谁 ▁知 道 ▁ 。".split() | ||
|
||
tokens = tokenizer.tokenize(text) | ||
self.assertListEqual(tokens, bpe_tokens) | ||
|
||
input_tokens = tokens + [tokenizer.unk_token] | ||
|
||
input_bpe_tokens = [13789, 13283, 1421, 8, 10, 1164, 13608, 16528, 63, 8, 9, 440, 108, 440, 121, 90, 8, 12, 0] | ||
self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens) | ||
|
||
reconstructed_text = tokenizer.decode(input_bpe_tokens) | ||
self.assertEqual(reconstructed_text, normalized_text) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you start the overview by mentioning the paper name & the authors of that paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please mention the abstract below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like what's done in BERT for example