Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data reader for mnist #1325

Closed
wants to merge 2 commits into from
Closed

Conversation

qingqing01
Copy link
Contributor

@qingqing01 qingqing01 commented Feb 13, 2017

  1. iterable的,每次调用返回一个batch的数据,每个pass/epoch的开始会shuffle数据。
  2. 每个pass/epoch的最后一个batch的样本数可能小于batch_size - 依然符合paddle以前的用法,这个对测试是有必要的 ( 虽然其他很多平台tf、torch、caffe等每个batch的样本数一定相等)。
  3. 只是针对MNIST用法,开始数据全部load到内存里。

大数据不一定都能load到内存,可能需要设计其他的缓存机制。



class DataReader(object):
def __init__(self, data, labels, batch_size, is_shuffle=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果有多个data或labels为空的情况,这个接口可以复用么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个只是针对MNIST,不是通用的,其他的任务需要重新写。

@qingqing01 qingqing01 mentioned this pull request Feb 13, 2017
num_magic, n, num_row, num_col = struct.unpack(">IIII", f.read(16))
images = np.fromfile(f, 'ubyte', count=n * num_row * num_col).\
reshape(n, num_row * num_col).astype('float32')
images = images / 255.0 * 2.0 - 1.0
Copy link
Contributor

@helinwang helinwang Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好奇images = images / 255.0 * 2.0 - 1.0这样把均值往0.0拉近一些,会比images = images / 255.0大概好多少?(比如说是98.55% -> 98.57%或者98.5%->98.9%),非常大概的估计就好。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images = images / 255.0 * 2.0 - 1.0 -> 是归到[-1, 1]
images = images / 255.0 ->[0, 1] 两者结果得做实验对比吧,感觉相差可能不会太大。

这里是继续采用了原始mnist demo的处理方式。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个取值范围就不一样吧,一个是[-1, 1],一个是[0, 1]。


def create_datasets(dir='./data/raw_data/'):
'''
数据download 和 load可以依据https://github.com/PaddlePaddle/Paddle/pull/872来简化
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有看到download的函数,感觉要是能自动download会方便用户使用一些。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#872 这个PR merge了之后会自动下载数据,这里就没有写。

@qingqing01 qingqing01 closed this Feb 24, 2017
@qingqing01 qingqing01 deleted the api_reader branch July 7, 2017 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants