-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paddle 大模型,paddle::GradientMachine::randParameters() Segmentation fault #3199
Comments
yinyunfeng
changed the title
paddle v1 大模型,paddle::GradientMachine::randParameters()
paddle 大模型,paddle::GradientMachine::randParameters()
Aug 3, 2017
typhoonzero
changed the title
paddle 大模型,paddle::GradientMachine::randParameters()
paddle 大模型,paddle::GradientMachine::randParameters() Segmentation fault
Aug 3, 2017
把网络配置贴一下?像是parameters没有写对。
如上代码,应该是parameters_为空; |
@yinyunfeng 这个配置在本地跑通过么? |
Inputs("feature", "label")
Layer(name = "feature", type = "data", size = feature_size)
Layer(name = "label", type = "data", size = 1)
Layer(
name = "embedding_fea",
type = "mixed",
active_type='',
size = dnn_layer_dims[0],
bias = False,
inputs = TableProjection("feature", parameter_name = "_emb_", decay_rate=1e-4, initial_std=0.02,learning_rate=1, sparse_remote_update = True),
#inputs = TableProjection("feature", parameter_name = "_emb_", decay_rate=1e-4, initial_std=0.02,learning_rate=1),
)
Layer(
inputs = [Input("embedding_fea")],
name = "emb_avg",
active_type = "tanh",
type = "average",
average_strategy="sum",
trans_type = "non-seq"
)
prev_layer = "emb_avg"
for i, dim in enumerate(dnn_layer_dims[1:]):
Layer(
inputs = [Input(prev_layer)],
name = "hidden-%d" % (i + 1),
active_type = "relu",
type = "fc",
size = dim)
prev_layer = "hidden-%d" % (i + 1)
Layer(
inputs = [Input(prev_layer)],
name = "hidden-last",
active_type = "softmax",
type = "fc",
size = 2)
Layer(inputs = ["hidden-last", 'label'], name = "cost", type = "multi-class-cross-entropy")
#Layer(inputs = ["hidden-last", 'label'], name = "cost", type = "multi_binary_label_cross_entropy")
Evaluator(inputs = ["hidden-last", "label"], name = "auc", type = "last-column-auc") |
本地可以跑通,没问题 |
Closed
@yinyunfeng 可以看下配置的优化算法是什么么?Adam和AdaDelta是不支持sparse的。 |
sgd @typhoonzero |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
the error info:
Thu Aug 3 00:42:46 2017[1,2]:+ ./paddle_trainer --num_gradient_servers=6 --trainer_id=2 --pservers=10.73.218.49,10.73.218.48,10.73.218.45,10.73.226.18,10.73.226.16,10.73.226.21 --rdma_tcp=tcp --nics=xgbe0 --port=7164 --ports_num=1 --dot_period=100 --use_old_updater=1 --test_all_data_in_one_period=true --pserver_num_threads=5 --loadsave_parameters_in_pserver=1 --log_period=100 --trainer_count=16 --ports_num_for_sparse=1 --num_passes=50 --saving_period=1 --local=0 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0
Thu Aug 3 00:43:28 2017[1,2]:*** Aborted at 1501692208 (unix time) try "date -d @1501692208" if you are using GNU date ***
Thu Aug 3 00:43:28 2017[1,2]:PC: @ 0x6ec623 paddle::GradientMachine::randParameters()
Thu Aug 3 00:43:28 2017[1,2]:*** SIGSEGV (@0x30) received by PID 3852 (TID 0x7f1f79428780) from PID 48; stack trace: ***
Thu Aug 3 00:43:28 2017[1,2]: @ 0x7f1f79002160 (unknown)
Thu Aug 3 00:43:28 2017[1,2]: @ 0x6ec623 paddle::GradientMachine::randParameters()
Thu Aug 3 00:43:28 2017[1,2]: @ 0x72a73c paddle::Trainer::init()
Thu Aug 3 00:43:28 2017[1,2]: @ 0x575ed4 main
Thu Aug 3 00:43:28 2017[1,2]: @ 0x7f1f77c0fbd5 __libc_start_main
Thu Aug 3 00:43:28 2017[1,2]: @ 0x584cc1 (unknown)
Thu Aug 3 00:43:32 2017[1,2]:./train.sh: line 207: 3852 Segmentation fault PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Thu Aug 3 00:43:32 2017[1,2]:+ '[' 139 -ne 0 ']'
Thu Aug 3 00:43:32 2017[1,2]:+ kill_pserver2_exit
Thu Aug 3 00:43:32 2017[1,2]:+ ps aux
Thu Aug 3 00:43:32 2017[1,2]:+ grep paddle_pserver2
Thu Aug 3 00:43:32 2017[1,2]:+ grep paddle_cluster_job
Thu Aug 3 00:43:32 2017[1,2]:+ grep -v grep
Thu Aug 3 00:43:32 2017[1,2]:+ cut -c10-14
Thu Aug 3 00:43:32 2017[1,2]:+ xargs kill -9
Thu Aug 3 00:43:32 2017[1,2]:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Thu Aug 3 00:43:32 2017[1,2]:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Thu Aug 3 00:43:32 2017[1,2]:[./common.sh : 399] [kill_pserver2_exit]
the job link is:
http://10.73.218.49:8920/fileview.html?path=/home/normandy/maybach/280706/workspace/log
The text was updated successfully, but these errors were encountered: