Skip to content

Commit 131a315

Browse files
author
tianxin
authored
add JiebaTokenizer demo (#4747)
1 parent 365fe58 commit 131a315

File tree

10 files changed

+129
-13
lines changed

10 files changed

+129
-13
lines changed

PaddleNLP/similarity_net/README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@
1212
| 模型 | 百度知道 | ECOM |QQSIM | UNICOM |
1313
|:-----------:|:-------------:|:-------------:|:-------------:|:-------------:|
1414
| | AUC | AUC | AUC|正逆序比|
15-
|BOW_Pairwise|0.6767|0.7329|0.7650|1.5630|
15+
|BOW_Pairwise(WordSeg)|0.6767|0.7329|0.7650|1.5630|
16+
|BOW_Pairwise(Jieba)|0.6658|0.7351|0.8431|1.5331|
1617
#### 测试集说明
1718
| 数据集 | 来源 | 垂类 |
1819
|:-----------:|:-------------:|:-------------:|
@@ -51,7 +52,10 @@ python download.py model
5152
```
5253

5354
#### 评估
54-
我们公开了自建的测试集,包括百度知道、ECOM、QQSIM、UNICOM四个数据集,基于上面的预训练模型,用户可以进入evaluate目录下依次执行下列命令获取测试集评估结果。
55+
我们公开了自建的测试集,包括百度知道、ECOM、QQSIM、UNICOM 四个数据集,基于上面的预训练模型,用户可以进入 evaluate 目录下依次执行下列命令获取测试集评估结果。
56+
57+
我们在以下评估脚本中以 Jieba 切词作为示例,如果您需要自定义切词模块,只需要在 [`tokenization.py`](tokenization.py) 中参考 `JiebaTokenizer` 实现自定义的切词类, 并且在 `evaluate_*.sh` 评估脚本中配置环境变量 `TOKENIZER=${YOUR_TOKENIZER_NAME}` 即可, 如果 `TOKENIZER` 环境变量为空, 则默认输入数据是切词后的数据(示例给出的数据是百度切词工具 WordSeg 切词后的数据)
58+
5559
```shell
5660
sh evaluate_ecom.sh
5761
sh evaluate_qqsim.sh

PaddleNLP/similarity_net/download.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,8 +96,8 @@ def download(url, filename, md5sum):
9696

9797
def download_dataset(dir_path):
9898
BASE_URL = "https://baidu-nlp.bj.bcebos.com/"
99-
DATASET_NAME = "simnet_dataset-1.0.0.tar.gz"
100-
DATASET_MD5 = "ec65b313bc237150ef536a8d26f3c73b"
99+
DATASET_NAME = "simnet_dataset-1.0.1.tar.gz"
100+
DATASET_MD5 = "4a381770178721b539e7cf0f91a8777d"
101101
file_path = os.path.join(dir_path, DATASET_NAME)
102102
url = BASE_URL + DATASET_NAME
103103

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
#get data
2-
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/simnet_dataset-1.0.0.tar.gz
3-
tar xzf simnet_dataset-1.0.0.tar.gz
4-
rm simnet_dataset-1.0.0.tar.gz
5-
2+
wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/simnet_dataset-1.0.1.tar.gz
3+
tar xzf simnet_dataset-1.0.1.tar.gz
4+
rm simnet_dataset-1.0.1.tar.gz

PaddleNLP/similarity_net/evaluate/evaluate_ecom.sh

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,21 @@ export FLAGS_sync_nccl_allreduce=1
44
export CUDA_VISIBLE_DEVICES=3
55
export FLAGS_fraction_of_gpu_memory_to_use=0.95
66
TASK_NAME='simnet'
7-
TEST_DATA_PATH=./data/ecom
87
VOCAB_PATH=./data/term2id.dict
98
CKPT_PATH=./model_files
109
TEST_RESULT_PATH=./evaluate/ecom_test_result
1110
TASK_MODE='pairwise'
1211
CONFIG_PATH=./config/bow_pairwise.json
1312
INIT_CHECKPOINT=./model_files/simnet_bow_pairwise_pretrained_model/
13+
14+
# use JiebaTokenizer to evaluate
15+
TOKENIZER="JiebaTokenizer"
16+
TEST_DATA_PATH=./data/ecom_raw
17+
18+
# use tokenized data by WordSeg to evaluate
19+
#TOKENIZER=""
20+
#TEST_DATA_PATH=./data/ecom
21+
1422
cd ..
1523

1624
python ./run_classifier.py \
@@ -23,5 +31,6 @@ python ./run_classifier.py \
2331
--test_result_path ${TEST_RESULT_PATH} \
2432
--config_path ${CONFIG_PATH} \
2533
--vocab_path ${VOCAB_PATH} \
34+
--tokenizer ${TOKENIZER:-""} \
2635
--task_mode ${TASK_MODE} \
2736
--init_checkpoint ${INIT_CHECKPOINT}

PaddleNLP/similarity_net/evaluate/evaluate_qqsim.sh

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,21 @@ export FLAGS_sync_nccl_allreduce=1
44
export CUDA_VISIBLE_DEVICES=3
55
export FLAGS_fraction_of_gpu_memory_to_use=0.95
66
TASK_NAME='simnet'
7-
TEST_DATA_PATH=./data/qqsim
87
VOCAB_PATH=./data/term2id.dict
98
CKPT_PATH=./model_files
109
TEST_RESULT_PATH=./evaluate/qqsim_test_result
1110
TASK_MODE='pairwise'
1211
CONFIG_PATH=./config/bow_pairwise.json
1312
INIT_CHECKPOINT=./model_files/simnet_bow_pairwise_pretrained_model/
13+
14+
# use JiebaTokenizer to evaluate
15+
TOKENIZER="JiebaTokenizer"
16+
TEST_DATA_PATH=./data/qqsim_raw
17+
18+
# use tokenized data by WordSeg to evaluate
19+
#TOKENIZER=""
20+
#TEST_DATA_PATH=./data/qqsim
21+
1422
cd ..
1523

1624
python ./run_classifier.py \
@@ -23,5 +31,6 @@ python ./run_classifier.py \
2331
--test_result_path ${TEST_RESULT_PATH} \
2432
--config_path ${CONFIG_PATH} \
2533
--vocab_path ${VOCAB_PATH} \
34+
--tokenizer ${TOKENIZER:-""} \
2635
--task_mode ${TASK_MODE} \
2736
--init_checkpoint ${INIT_CHECKPOINT}

PaddleNLP/similarity_net/evaluate/evaluate_unicom.sh

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,21 @@ export FLAGS_sync_nccl_allreduce=1
44
export CUDA_VISIBLE_DEVICES=3
55
export FLAGS_fraction_of_gpu_memory_to_use=0.95
66
TASK_NAME='simnet'
7-
INFER_DATA_PATH=./evaluate/unicom_infer
87
VOCAB_PATH=./data/term2id.dict
98
CKPT_PATH=./model_files
109
INFER_RESULT_PATH=./evaluate/unicom_infer_result
1110
TASK_MODE='pairwise'
1211
CONFIG_PATH=./config/bow_pairwise.json
1312
INIT_CHECKPOINT=./model_files/simnet_bow_pairwise_pretrained_model/
1413

14+
# use JiebaTokenizer to evaluate
15+
TOKENIZER="JiebaTokenizer"
16+
INFER_DATA_PATH=./data/unicom_infer_raw
17+
18+
# use tokenized data by WordSeg to evaluate
19+
#TOKENIZER=""
20+
#INFER_DATA_PATH=./evaluate/unicom_infer
21+
1522
python unicom_split.py
1623
cd ..
1724
python ./run_classifier.py \
@@ -23,8 +30,8 @@ python ./run_classifier.py \
2330
--infer_result_path ${INFER_RESULT_PATH} \
2431
--config_path ${CONFIG_PATH} \
2532
--vocab_path ${VOCAB_PATH} \
33+
--tokenizer ${TOKENIZER:-""} \
2634
--task_mode ${TASK_MODE} \
2735
--init_checkpoint ${INIT_CHECKPOINT}
2836
cd evaluate
2937
python unicom_compute_pos_neg.py
30-

PaddleNLP/similarity_net/evaluate/evaluate_zhidao.sh

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,21 @@ export FLAGS_sync_nccl_allreduce=1
44
export CUDA_VISIBLE_DEVICES=3
55
export FLAGS_fraction_of_gpu_memory_to_use=0.95
66
TASK_NAME='simnet'
7-
TEST_DATA_PATH=./data/zhidao
87
VOCAB_PATH=./data/term2id.dict
98
CKPT_PATH=./model_files
109
TEST_RESULT_PATH=./evaluate/zhidao_test_result
1110
TASK_MODE='pairwise'
1211
CONFIG_PATH=./config/bow_pairwise.json
1312
INIT_CHECKPOINT=./model_files/simnet_bow_pairwise_pretrained_model/
13+
14+
# use JiebaTokenizer to evaluate
15+
TOKENIZER="JiebaTokenizer"
16+
TEST_DATA_PATH=./data/zhidao_raw
17+
18+
# use tokenized data by WordSeg to evaluate
19+
#TOKENIZER=""
20+
#TEST_DATA_PATH=./data/zhidao
21+
1422
cd ..
1523

1624
python ./run_classifier.py \
@@ -23,5 +31,6 @@ python ./run_classifier.py \
2331
--test_result_path ${TEST_RESULT_PATH} \
2432
--config_path ${CONFIG_PATH} \
2533
--vocab_path ${VOCAB_PATH} \
34+
--tokenizer ${TOKENIZER:-""} \
2635
--task_mode ${TASK_MODE} \
2736
--init_checkpoint ${INIT_CHECKPOINT}

PaddleNLP/similarity_net/reader.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
import numpy as np
2020
import io
2121

22+
import tokenization
2223

2324
class SimNetProcessor(object):
2425
def __init__(self, args, vocab):
@@ -27,6 +28,10 @@ def __init__(self, args, vocab):
2728
self.vocab = vocab
2829
self.valid_label = np.array([])
2930
self.test_label = np.array([])
31+
if args.tokenizer:
32+
self.tokenizer = getattr(tokenization, args.tokenizer)()
33+
else:
34+
self.tokenizer = None
3035

3136
def get_reader(self, mode, epoch=0):
3237
"""
@@ -48,6 +53,12 @@ def reader_with_pairwise():
4853
logging.warning(
4954
"line not match format in test file")
5055
continue
56+
57+
# tokenize
58+
if self.tokenizer:
59+
query = self.tokenizer.tokenize(query)
60+
title = self.tokenizer.tokenize(title)
61+
5162
query = [
5263
self.vocab[word] for word in query.split(" ")
5364
if word in self.vocab
@@ -71,6 +82,12 @@ def reader_with_pairwise():
7182
logging.warning(
7283
"line not match format in test file")
7384
continue
85+
86+
# tokenize
87+
if self.tokenizer:
88+
query = self.tokenizer.tokenize(query)
89+
title = self.tokenizer.tokenize(title)
90+
7491
query = [
7592
self.vocab[word] for word in query.split(" ")
7693
if word in self.vocab
@@ -95,6 +112,12 @@ def reader_with_pairwise():
95112
logging.warning(
96113
"line not match format in test file")
97114
continue
115+
# tokenize
116+
if self.tokenizer:
117+
query = self.tokenizer.tokenize(query)
118+
pos_title = self.tokenizer.tokenize(pos_title)
119+
neg_title = self.tokenizer.tokenize(neg_title)
120+
98121
query = [
99122
self.vocab[word] for word in query.split(" ")
100123
if word in self.vocab
@@ -130,6 +153,12 @@ def reader_with_pointwise():
130153
logging.warning(
131154
"line not match format in test file")
132155
continue
156+
157+
# tokenize
158+
if self.tokenizer:
159+
query = self.tokenizer.tokenize(query)
160+
title = self.tokenizer.tokenize(title)
161+
133162
query = [
134163
self.vocab[word] for word in query.split(" ")
135164
if word in self.vocab
@@ -153,6 +182,12 @@ def reader_with_pointwise():
153182
logging.warning(
154183
"line not match format in test file")
155184
continue
185+
186+
# tokenize
187+
if self.tokenizer:
188+
query = self.tokenizer.tokenize(query)
189+
title = self.tokenizer.tokenize(title)
190+
156191
query = [
157192
self.vocab[word] for word in query.split(" ")
158193
if word in self.vocab
@@ -178,6 +213,12 @@ def reader_with_pointwise():
178213
logging.warning(
179214
"line not match format in test file")
180215
continue
216+
217+
# tokenize
218+
if self.tokenizer:
219+
query = self.tokenizer.tokenize(query)
220+
title = self.tokenizer.tokenize(title)
221+
181222
query = [
182223
self.vocab[word] for word in query.split(" ")
183224
if word in self.vocab
@@ -208,6 +249,10 @@ def get_infer_reader(self):
208249
if len(query) == 0 or len(title) == 0:
209250
logging.warning("line not match format in test file")
210251
continue
252+
# tokenize
253+
if self.tokenizer:
254+
query = self.tokenizer.tokenize(query)
255+
title = self.tokenizer.tokenize(title)
211256
query = [
212257
self.vocab[word] for word in query.split(" ")
213258
if word in self.vocab
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""Tokenization classes."""
16+
17+
from __future__ import absolute_import
18+
from __future__ import division
19+
from __future__ import print_function
20+
21+
import jieba
22+
23+
class JiebaTokenizer(object):
24+
"""Runs end-to-end tokenziation."""
25+
26+
def __init__(self):
27+
# Todo:
28+
pass
29+
30+
def tokenize(self, text):
31+
split_tokens = jieba.cut(text)
32+
split_tokens = " ".join([word for word in split_tokens])
33+
return split_tokens

PaddleNLP/similarity_net/utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,7 @@ def __init__(self):
214214
data_g.add_arg("infer_data_dir", str, None,
215215
"Directory path to infer data.")
216216
data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
217+
data_g.add_arg("tokenizer", str, None, "Whether or not use user defined tokenizer")
217218
data_g.add_arg("batch_size", int, 32,
218219
"Total examples' number in batch for training.")
219220

0 commit comments

Comments
 (0)