Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demo/quick_start/preprocess.sh takes hours on a VM with 1 CPU #111

Closed
wangkuiyi opened this issue Sep 23, 2016 · 4 comments
Closed

demo/quick_start/preprocess.sh takes hours on a VM with 1 CPU #111

wangkuiyi opened this issue Sep 23, 2016 · 4 comments
Assignees
Labels

Comments

@wangkuiyi
Copy link
Collaborator

I checked with Yi Yang that on his powerful desktop, it takes a few minutes. Anyway, as a quick start, is it supposed to be quick? :-)

@reyoung
Copy link
Collaborator

reyoung commented Nov 21, 2016

可能的解决办法是。我们选择一个小的数据集,去掉一些预处理内容。

1、小的数据集。 quick start使用的是亚马逊电子产品的数据,但是同样网页中提供了一些更小的数据集。

2、去掉一些预处理内容。去掉英文的分词和标点符号处理(这一块是用perl去做的)。可以直接在python里面repleace一些基础的标点符号,分词等等。不一定需要这么专业。

这个解决方案的问题在于:

小的数据集可能会导致复杂模型的效果不一定比简单模型的效果好。

@backyes
Copy link
Contributor

backyes commented Nov 25, 2016

@wangkuiyi @reyoung 这个issue提到的preprocess问题,似乎近期会有解决方案(似乎要提供已经处理好的数据)。

@qingqing01 @luotao1 谁知道的, 麻烦update一下,比如如果有pr,可以关联过来。

@reyoung
Copy link
Collaborator

reyoung commented Nov 29, 2016

@qingqing01 Should this issue be closed or not? Is it resolved?

@qingqing01
Copy link
Contributor

qingqing01 commented Dec 5, 2016

#573 这个PR提供了预处理好的数据,只需要下载数据解压即可。 可以不经过预处理脚本处理,这样可以省去预处理时间。

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
Thunderbrook pushed a commit to Thunderbrook/Paddle that referenced this issue Sep 16, 2022
* perform ssd sparsetable;test=develop

Conflicts:
	paddle/fluid/framework/fleet/ps_gpu_wrapper.cc

* perform ssd sparsetable;test=develop

* remove debug code;

* remove debug code;

* add jemalloc cmake;test=develop

* fix wrapper;test=develop
qingshui referenced this issue in qingshui/Paddle Nov 14, 2022
* Optimizing the zero key problem in the push phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Performance optimization, segment gradient merging

* Performance optimization, segment gradient merging

* Optimize pullsparse and increase keys aggregation

* sync gpugraph to gpugraph_v2 (#86)

* change load node and edge from local to cpu (#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] graph sample v2 (#87)

* change load node and edge from local to cpu (#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* support ssdsparsetable;test=develop (#81)

* graph sample v2

* remove log

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>

* Release cpu graph

* uniq nodeid (#89)

* compatible whole HBM mode (#91)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* Gpugraph v2 (#93)

* compatible whole HBM mode

* unify flag for graph emd storage mode and graph struct storage mode

* format

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* split generate batch into multi stage (#92)

* split generate batch into multi stage

* fix conflict

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* [GpuGraph] Uniq feature (#95)

* uniq feature

* uniq feature

* uniq feature

* [GpuGraph]  global startid (#98)

* uniq feature

* uniq feature

* uniq feature

* global startid

* load node edge seperately and release graph (#99)

* load node edge seperately and release graph

* load node edge seperately and release graph

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* v2 infer (#102)

* optimize begin pass and end pass (#106)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix ins no (#104)

* [GPUGraph] fix FillOneStep args (#107)

* fix ins no

* fix FillOnestep args

* fix bug for whole hbm mode (#110)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] fix infer && add infer_table_cap (#108)

* fix ins no

* fix FillOnestep args

* fix infer && add infer table cap

* fix infer

* 【PSCORE】perform ssd sparse table  (#111)

* perform ssd sparsetable;test=develop

Conflicts:
	paddle/fluid/framework/fleet/ps_gpu_wrapper.cc

* perform ssd sparsetable;test=develop

* remove debug code;

* remove debug code;

* add jemalloc cmake;test=develop

* fix wrapper;test=develop

* fix sample core (#114)

* [GpuGraph] optimize shuffle batch (#115)

* fix sample core

* optimize shuffle batch

* release gpu mem when sample end (#116)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix class not found err (PaddlePaddle#118)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* optimize sample (PaddlePaddle#117)

* optimize sample

* optimize sample

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix clear gpu mem (PaddlePaddle#119)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix sample core (PaddlePaddle#121)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* add ssd cache (PaddlePaddle#123)

* add ssd cache;test=develop

* add ssd cache;test=develop

* add ssd cache;test=develop

* add multi epoch train & fix train table change ins & save infer embeding  (PaddlePaddle#129)

* add multi epoch train & fix train table change ins & save infer embedding

* change epoch finish judge

* change epoch finish change

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* Add debug log (PaddlePaddle#131)

* Add debug log

* Add debug log

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>

* optimize mem in  uniq slot feature (PaddlePaddle#130)

* [GpuGraph] cherry pick var slot feature && fix load multi path node (PaddlePaddle#136)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* [GpuGraph] fix kernel overflow (PaddlePaddle#138)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

* fix kernel overflow && add max feature num flag

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* fix ssd cache;test=develop (PaddlePaddle#139)

* slot feature secondary storage (PaddlePaddle#140)

* slot feature secondary storage

* slot feature secondary storage

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
zmxdream pushed a commit to zmxdream/Paddle that referenced this issue Dec 7, 2022
* Optimizing the zero key problem in the push phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Optimize CUDA thread parallelism in MergeGrad phase

* Performance optimization, segment gradient merging

* Performance optimization, segment gradient merging

* Optimize pullsparse and increase keys aggregation

* sync gpugraph to gpugraph_v2 (PaddlePaddle#86)

* change load node and edge from local to cpu (PaddlePaddle#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(PaddlePaddle#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] graph sample v2 (PaddlePaddle#87)

* change load node and edge from local to cpu (PaddlePaddle#83)

* change load node and edge

* remove useless code

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* extract pull sparse as single stage(PaddlePaddle#85)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* support ssdsparsetable;test=develop (PaddlePaddle#81)

* graph sample v2

* remove log

Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>

* Release cpu graph

* uniq nodeid (PaddlePaddle#89)

* compatible whole HBM mode (PaddlePaddle#91)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* Gpugraph v2 (PaddlePaddle#93)

* compatible whole HBM mode

* unify flag for graph emd storage mode and graph struct storage mode

* format

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* split generate batch into multi stage (PaddlePaddle#92)

* split generate batch into multi stage

* fix conflict

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* [GpuGraph] Uniq feature (PaddlePaddle#95)

* uniq feature

* uniq feature

* uniq feature

* [GpuGraph]  global startid (PaddlePaddle#98)

* uniq feature

* uniq feature

* uniq feature

* global startid

* load node edge seperately and release graph (PaddlePaddle#99)

* load node edge seperately and release graph

* load node edge seperately and release graph

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* v2 infer (PaddlePaddle#102)

* optimize begin pass and end pass (PaddlePaddle#106)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix ins no (PaddlePaddle#104)

* [GPUGraph] fix FillOneStep args (PaddlePaddle#107)

* fix ins no

* fix FillOnestep args

* fix bug for whole hbm mode (PaddlePaddle#110)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* [GPUGraph] fix infer && add infer_table_cap (PaddlePaddle#108)

* fix ins no

* fix FillOnestep args

* fix infer && add infer table cap

* fix infer

* 【PSCORE】perform ssd sparse table  (PaddlePaddle#111)

* perform ssd sparsetable;test=develop

Conflicts:
	paddle/fluid/framework/fleet/ps_gpu_wrapper.cc

* perform ssd sparsetable;test=develop

* remove debug code;

* remove debug code;

* add jemalloc cmake;test=develop

* fix wrapper;test=develop

* fix sample core (PaddlePaddle#114)

* [GpuGraph] optimize shuffle batch (PaddlePaddle#115)

* fix sample core

* optimize shuffle batch

* release gpu mem when sample end (PaddlePaddle#116)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix class not found err (PaddlePaddle#118)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* optimize sample (PaddlePaddle#117)

* optimize sample

* optimize sample

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* fix clear gpu mem (PaddlePaddle#119)

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* fix sample core (PaddlePaddle#121)

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

* add ssd cache (PaddlePaddle#123)

* add ssd cache;test=develop

* add ssd cache;test=develop

* add ssd cache;test=develop

* add multi epoch train & fix train table change ins & save infer embeding  (PaddlePaddle#129)

* add multi epoch train & fix train table change ins & save infer embedding

* change epoch finish judge

* change epoch finish change

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>

* Add debug log (PaddlePaddle#131)

* Add debug log

* Add debug log

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>

* optimize mem in  uniq slot feature (PaddlePaddle#130)

* [GpuGraph] cherry pick var slot feature && fix load multi path node (PaddlePaddle#136)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* [GpuGraph] fix kernel overflow (PaddlePaddle#138)

* optimize mem in  uniq slot feature

* cherry-pick var slot_feature

* fix kernel overflow && add max feature num flag

Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>

* fix ssd cache;test=develop (PaddlePaddle#139)

* slot feature secondary storage (PaddlePaddle#140)

* slot feature secondary storage

* slot feature secondary storage

Co-authored-by: yangjunchao <yangjunchao@baidu.com>

Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com>
Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com>
Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com>
Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com>
Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com>
Co-authored-by: yangjunchao <yangjunchao@baidu.com>
Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com>
Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com>
Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com>
tianyan01 pushed a commit to tianyan01/Paddle that referenced this issue Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants