-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/data api #2
Conversation
@@ -62,9 +72,36 @@ class Categories(object): | |||
__md5__[Automotive] = '757fdb1ab2c5e2fc0934047721082011' | |||
__md5__[Baby] = '7698a4179a1d8385e946ed9083490d22' | |||
__md5__[Beauty] = '5d2ccdcd86641efcfbae344317c10829' | |||
__md5__[Books] = 'bc1e2aa650fe51f978e9d3a7a4834bc6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
补全了所有数据的md5值,检查了下载文件的md5值
return h.hexdigest() | ||
def preprocess(category=None, directory=None): | ||
""" | ||
Download and preprocess amazon reviews data set. Save the preprocessed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
预处理
将评论数据tokenize,计算字典,映射成词的id,并且将评分全部存入hdf5
return self.__sentence__[ | ||
idx], self.__label__ >= self.__positive_threshold__ | ||
|
||
def train_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train_data会返回一个generator,每次返回一条训练数据
|
||
|
||
class AmazonReviewsTest(unittest.TestCase): | ||
def test_read_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测。
No description provided.