Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/data api #2

Merged
merged 3 commits into from
Jan 15, 2017
Merged

Conversation

reyoung
Copy link

@reyoung reyoung commented Jan 14, 2017

No description provided.

@@ -62,9 +72,36 @@ class Categories(object):
__md5__[Automotive] = '757fdb1ab2c5e2fc0934047721082011'
__md5__[Baby] = '7698a4179a1d8385e946ed9083490d22'
__md5__[Beauty] = '5d2ccdcd86641efcfbae344317c10829'
__md5__[Books] = 'bc1e2aa650fe51f978e9d3a7a4834bc6'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

补全了所有数据的md5值,检查了下载文件的md5值

return h.hexdigest()
def preprocess(category=None, directory=None):
"""
Download and preprocess amazon reviews data set. Save the preprocessed
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

预处理

将评论数据tokenize,计算字典,映射成词的id,并且将评分全部存入hdf5

return self.__sentence__[
idx], self.__label__ >= self.__positive_threshold__

def train_data(self):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_data会返回一个generator,每次返回一条训练数据



class AmazonReviewsTest(unittest.TestCase):
def test_read_data(self):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测。

@beckett1124 beckett1124 merged commit 60b6ef5 into beckett1124:develop Jan 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants