Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse training cluster时在pass0后失败 #660

Closed
CDDB opened this issue Nov 29, 2016 · 10 comments · Fixed by #891
Closed

sparse training cluster时在pass0后失败 #660

CDDB opened this issue Nov 29, 2016 · 10 comments · Fixed by #891
Assignees
Labels

Comments

@CDDB
Copy link

CDDB commented Nov 29, 2016

集群配置问题,转到内网

@CDDB CDDB changed the title sparse training的若干问题 sparse training cluster版本运行失败,local有异常日志 Nov 30, 2016
@backyes
Copy link
Contributor

backyes commented Nov 30, 2016

@tianbingsz 也知晓下。sparse相关模型重构后,用户反馈若干问题,需要深入分析问题。

@CDDB
Copy link
Author

CDDB commented Nov 30, 2016

我现在用的是icode的版本,最近一次ci是10月8日(* master 7c60b90 Merge "remove PserverForPython.h which is not used")。 不知道是否已经被重构了。

另外。我们尝试过sparse-binary-vec的cluster版本的sparse-train,遇到同样p-server启动失败的问题。

@backyes
Copy link
Contributor

backyes commented Nov 30, 2016

icode版本不再维护,且后续github 主干有若干关于sparse训练的bugfix,故请更新到新代码,内部有新版本receiver(通过内部渠道沟通),您只要更换下receiver配置即可使能最新版本。

@backyes
Copy link
Contributor

backyes commented Nov 30, 2016

@CDDB 请关注deeplearning.baidu.com,面向百度同学的使用文档介绍,获取有关集群信息。

@CDDB
Copy link
Author

CDDB commented Nov 30, 2016

收到, 能确认我遇到的问题是已知问题,并且已经fix了么?

@backyes
Copy link
Contributor

backyes commented Nov 30, 2016

暂时不能确定

@CDDB
Copy link
Author

CDDB commented Nov 30, 2016

好的, 我转到内网询问。 这个issue我删掉

@CDDB CDDB changed the title sparse training cluster版本运行失败,local有异常日志 sparse training cluster时在pass0后失败 Dec 1, 2016
@CDDB
Copy link
Author

CDDB commented Dec 1, 2016

成功跑了一轮Pass,但是在Eval结果还没有出现前挂了。 似乎有几个关联问题?
采用公司最新receiver
I1201 14:22:15.860833 24546 ThreadLocal.cpp:37] thread use undeterministic rand seed:24547
I1201 14:43:17.116703 21909 TrainerInternal.cpp:182] Pass=0 Batch=711 samples=35526 AvgCost=0.245079 Eval:
F1201 14:43:17.771586 21981 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
F1201 14:43:17.771664 21983 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295) F1201 14:43::1717..771920 21980 SparseRowMatrix.hSparseRowMatrix.h::63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
'''
更多错误日志
'''
u Dec 1 14:43:17 2016[1,3]:F1201 14:43:17.139876 11101 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
Thu Dec 1 14:43:17 2016[1,3]:*** Check failure stack trace: ***
Thu Dec 1 14:43:17 2016[1,3]:F1201 14:43:17.139876 11103 S0:63] 12Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
Thu Dec 1 14:43:17 2016[1,3]:1 14:43:17.139876 11104 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295) F1201 14:43:17.140319 11102 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
Thu Dec 1 14:43:17 2016[1,3]:*** Check failure stack trace: ***
Thu Dec 1 14:43:17 2016[1,3]:F1201 14:43:17.139876 11103 S0:63] 12Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
Thu Dec 1 14:43:17 2016[1,3]:1 14:43:17.139876 11104 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295) F1201 14:43:17.140319 11102 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
Thu Dec 1 14:43:17 2016[1,3]:*** Check failure stack trace: ***
Thu Dec 1 14:43:17 2016[1,3]:F1201 14:43:17.139876 11103 S0:63] 12Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
Thu Dec 1 14:43:17 2016[1,3]:1 14:43:17.139876 11104 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295) F1201 14:43:17.140319 11102 SparseRowMatrix.h:63] Check failed: globalIndices_[row] != kUnusedId_ (4294967295 vs. 4294967295)
Thu Dec 1 14:43:17 2016[1,3]:*** Check failure stack trace: ***
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13aeb38 google::LogMessage::Fail()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13aeb38 google::LogMessage::Fail()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13aeb38 google::LogMessage::Fail()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13aeb38 google::LogMessage::Fail()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13aea90 google::LogMessage::SendToLog()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13aea90 google::LogMessage::SendToLog()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13aea90 google::LogMessage::SendToLog()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13aea90 google::LogMessage::SendToLog()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13ae525 google::LogMessage::Flush()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13ae525 google::LogMessage::Flush()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13ae525 google::LogMessage::Flush()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13ae525 google::LogMessage::Flush()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13b12e6 google::LogMessageFatal::~LogMessageFatal()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13b12e6 google::LogMessageFatal::~LogMessageFatal()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13b12e6 google::LogMessageFatal::~LogMessageFatal()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x13b12e6 google::LogMessageFatal::~LogMessageFatal()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7e5bec paddle::CpuMatrix::mul<>()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7e5bec paddle::CpuMatrix::mul<>()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7e5bec paddle::CpuMatrix::mul<>()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7e5bec paddle::CpuMatrix::mul<>()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7d5b77 paddle::CpuMatrix::mul()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7d5b77 paddle::CpuMatrix::mul()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7d5b77 paddle::CpuMatrix::mul()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7d5b77 paddle::CpuMatrix::mul()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x675f1e paddle::FullyConnectedLayer::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x675f1e paddle::FullyConnectedLayer::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x675f1e paddle::FullyConnectedLayer::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x675f1e paddle::FullyConnectedLayer::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6d36a4 paddle::NeuralNetwork::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6d36a4 paddle::NeuralNetwork::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6d36a4 paddle::NeuralNetwork::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6d36a4 paddle::NeuralNetwork::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6c9ae6 paddle::TrainerThread::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6c9ae6 paddle::TrainerThread::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6c9ae6 paddle::TrainerThread::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6c9ae6 paddle::TrainerThread::forward()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6cac28 paddle::TrainerThread::computeThread()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6cac28 paddle::TrainerThread::computeThread()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6cac28 paddle::TrainerThread::computeThread()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x6cac28 paddle::TrainerThread::computeThread()
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b565f8a0 execute_native_thread_routine
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b565f8a0 execute_native_thread_routine
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b565f8a0 execute_native_thread_routine
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b565f8a0 execute_native_thread_routine
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b5edd1c3 start_thread
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b5edd1c3 start_thread
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b5edd1c3 start_thread
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b5edd1c3 start_thread
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b4dd012d __clone
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b4dd012d __clone
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b4dd012d __clone
Thu Dec 1 14:43:17 2016[1,3]: @ 0x7f36b4dd012d __clone
'''

@CDDB
Copy link
Author

CDDB commented Dec 1, 2016

@backyes

@CDDB
Copy link
Author

CDDB commented Dec 1, 2016

确认如果取消Test就可以跑通

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
* fix_windows

* Final update 1.3 (PaddlePaddle#653)

* thorough clean

* delete_DS_Store

* update_1.3
Meiyim pushed a commit to Meiyim/Paddle that referenced this issue May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants