Skip to content

Commit

Permalink
Merge pull request #573 from qingqing01/quick_start
Browse files Browse the repository at this point in the history
Update the data of quick start.
  • Loading branch information
luotao1 authored Nov 28, 2016
2 parents 922ee9b + 0561dd0 commit 2f60248
Show file tree
Hide file tree
Showing 8 changed files with 52 additions and 32 deletions.
9 changes: 9 additions & 0 deletions demo/quick_start/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
This dataset consists of electronics product reviews associated with
binary labels (positive/negative) for sentiment classification.

The preprocessed data can be downloaded by script `get_data.sh`.
The data was derived from reviews_Electronics_5.json.gz at

http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz

If you want to process the raw data, you can use the script `proc_from_raw_data/get_data.sh`.
15 changes: 6 additions & 9 deletions demo/quick_start/data/get_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,11 @@ set -e
DIR="$( cd "$(dirname "$0")" ; pwd -P )"
cd $DIR

echo "Downloading Amazon Electronics reviews data..."
# http://jmcauley.ucsd.edu/data/amazon/
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
# Download the preprocessed data
wget http://paddlepaddle.bj.bcebos.com/demo/quick_start_preprocessed_data/preprocessed_data.tar.gz

echo "Downloading mosesdecoder..."
#https://github.com/moses-smt/mosesdecoder
wget https://github.com/moses-smt/mosesdecoder/archive/master.zip
# Extract package
tar zxvf preprocessed_data.tar.gz

unzip master.zip
rm master.zip
echo "Done."
# Remove compressed package
rm preprocessed_data.tar.gz
1 change: 0 additions & 1 deletion demo/quick_start/data/pred.list

This file was deleted.

2 changes: 0 additions & 2 deletions demo/quick_start/data/pred.txt

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,26 @@
# 1. size of pos : neg = 1:1.
# 2. size of testing set = min(25k, len(all_data) * 0.1), others is traning set.
# 3. distinct train set and test set.
# 4. build dict

set -e

DIR="$( cd "$(dirname "$0")" ; pwd -P )"
cd $DIR

# Download data
echo "Downloading Amazon Electronics reviews data..."
# http://jmcauley.ucsd.edu/data/amazon/
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
echo "Downloading mosesdecoder..."
# https://github.com/moses-smt/mosesdecoder
wget https://github.com/moses-smt/mosesdecoder/archive/master.zip

unzip master.zip
rm master.zip

##################
# Preprocess data
echo "Preprocess data..."
export LC_ALL=C
UNAME_STR=`uname`

Expand All @@ -29,11 +45,11 @@ else
SHUF_PROG='gshuf'
fi

mkdir -p data/tmp
python preprocess.py -i data/reviews_Electronics_5.json.gz
mkdir -p tmp
python preprocess.py -i reviews_Electronics_5.json.gz
# uniq and shuffle
cd data/tmp
echo 'uniq and shuffle...'
cd tmp
echo 'Uniq and shuffle...'
cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed
cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed

Expand All @@ -53,11 +69,11 @@ cat train.pos train.neg | ${SHUF_PROG} >../train.txt
cat test.pos test.neg | ${SHUF_PROG} >../test.txt

cd -
echo 'data/train.txt' > data/train.list
echo 'data/test.txt' > data/test.list
echo 'train.txt' > train.list
echo 'test.txt' > test.list

# use 30k dict
rm -rf data/tmp
mv data/dict.txt data/dict_all.txt
cat data/dict_all.txt | head -n 30001 > data/dict.txt
echo 'preprocess finished'
rm -rf tmp
mv dict.txt dict_all.txt
cat dict_all.txt | head -n 30001 > dict.txt
echo 'Done.'
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
1. (remove HTML before or not)tokensizing
1. Tokenize the words and punctuation
2. pos sample : rating score 5; neg sample: rating score 1-2.
Usage:
Expand Down Expand Up @@ -76,7 +76,11 @@ def tokenize(sentences):
sentences : a list of input sentences.
return: a list of processed text.
"""
dir = './data/mosesdecoder-master/scripts/tokenizer/tokenizer.perl'
dir = './mosesdecoder-master/scripts/tokenizer/tokenizer.perl'
if not os.path.exists(dir):
sys.exit(
"The ./mosesdecoder-master/scripts/tokenizer/tokenizer.perl does not exists."
)
tokenizer_cmd = [dir, '-l', 'en', '-q', '-']
assert isinstance(sentences, list)
text = "\n".join(sentences)
Expand Down Expand Up @@ -104,7 +108,7 @@ def tokenize_batch(id):
num_batch, instance, pre_fix = parse_queue.get()
if num_batch == -1: ### parse_queue finished
tokenize_queue.put((-1, None, None))
sys.stderr.write("tokenize theread %s finish\n" % (id))
sys.stderr.write("Thread %s finish\n" % (id))
break
tokenize_instance = tokenize(instance)
tokenize_queue.put((num_batch, tokenize_instance, pre_fix))
Expand Down
3 changes: 1 addition & 2 deletions doc/demo/quick_start/index_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,11 @@ To build your text classification system, your code will need to perform five st
## Preprocess data into standardized format
In this example, you are going to use [Amazon electronic product review dataset](http://jmcauley.ucsd.edu/data/amazon/) to build a bunch of deep neural network models for text classification. Each text in this dataset is a product review. This dataset has two categories: “positive” and “negative”. Positive means the reviewer likes the product, while negative means the reviewer does not like the product.

`demo/quick_start` in the [source code](https://github.com/baidu/Paddle) provides scripts for downloading data and preprocessing data as shown below. The data process takes several minutes (about 3 minutes in our machine).
`demo/quick_start` in the [source code](https://github.com/PaddlePaddle/Paddle) provides script for downloading the preprocessed data as shown below. (If you want to process the raw data, you can use the script `demo/quick_start/data/proc_from_raw_data/get_data.sh`).

```bash
cd demo/quick_start
./data/get_data.sh
./preprocess.sh
```

## Transfer Data to Model
Expand Down
6 changes: 2 additions & 4 deletions doc_cn/demo/quick_start/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,11 @@

## 数据格式准备(Data Preparation)
在本问题中,我们使用[Amazon电子产品评论数据](http://jmcauley.ucsd.edu/data/amazon/)
将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/baidu/Paddle)`demo/quick_start`里提供了数据下载脚本
和预处理脚本。
将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`里提供了下载已经预处理数据的脚本(如果想从最原始的数据处理,可以使用脚本 `./demo/quick_start/data/proc_from_raw_data/get_data.sh`)。

```bash
cd demo/quick_start
./data/get_data.sh
./preprocess.sh
```

## 数据向模型传送(Transfer Data to Model)
Expand Down Expand Up @@ -143,7 +141,7 @@ PyDataProvider2</a>。

我们将以基本的逻辑回归网络作为起点,并逐渐展示更加深入的功能。更详细的网络配置
连接请参考<a href = "../../../doc/layer.html">Layer文档</a>。
所有配置在[源码](https://github.com/baidu/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。
所有配置在[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。

### 逻辑回归模型(Logistic Regression)

Expand Down

0 comments on commit 2f60248

Please sign in to comment.