Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paddle 大模型,paddle::GradientMachine::randParameters() Segmentation fault #3199

Closed
yinyunfeng opened this issue Aug 3, 2017 · 6 comments · Fixed by #3518
Closed

paddle 大模型,paddle::GradientMachine::randParameters() Segmentation fault #3199

yinyunfeng opened this issue Aug 3, 2017 · 6 comments · Fixed by #3518
Assignees

Comments

@yinyunfeng
Copy link

yinyunfeng commented Aug 3, 2017

the error info:

Thu Aug 3 00:42:46 2017[1,2]:+ ./paddle_trainer --num_gradient_servers=6 --trainer_id=2 --pservers=10.73.218.49,10.73.218.48,10.73.218.45,10.73.226.18,10.73.226.16,10.73.226.21 --rdma_tcp=tcp --nics=xgbe0 --port=7164 --ports_num=1 --dot_period=100 --use_old_updater=1 --test_all_data_in_one_period=true --pserver_num_threads=5 --loadsave_parameters_in_pserver=1 --log_period=100 --trainer_count=16 --ports_num_for_sparse=1 --num_passes=50 --saving_period=1 --local=0 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0
Thu Aug 3 00:43:28 2017[1,2]:*** Aborted at 1501692208 (unix time) try "date -d @1501692208" if you are using GNU date ***
Thu Aug 3 00:43:28 2017[1,2]:PC: @ 0x6ec623 paddle::GradientMachine::randParameters()
Thu Aug 3 00:43:28 2017[1,2]:*** SIGSEGV (@0x30) received by PID 3852 (TID 0x7f1f79428780) from PID 48; stack trace: ***
Thu Aug 3 00:43:28 2017[1,2]: @ 0x7f1f79002160 (unknown)
Thu Aug 3 00:43:28 2017[1,2]: @ 0x6ec623 paddle::GradientMachine::randParameters()
Thu Aug 3 00:43:28 2017[1,2]: @ 0x72a73c paddle::Trainer::init()
Thu Aug 3 00:43:28 2017[1,2]: @ 0x575ed4 main
Thu Aug 3 00:43:28 2017[1,2]: @ 0x7f1f77c0fbd5 __libc_start_main
Thu Aug 3 00:43:28 2017[1,2]: @ 0x584cc1 (unknown)
Thu Aug 3 00:43:32 2017[1,2]:./train.sh: line 207: 3852 Segmentation fault PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Thu Aug 3 00:43:32 2017[1,2]:+ '[' 139 -ne 0 ']'
Thu Aug 3 00:43:32 2017[1,2]:+ kill_pserver2_exit
Thu Aug 3 00:43:32 2017[1,2]:+ ps aux
Thu Aug 3 00:43:32 2017[1,2]:+ grep paddle_pserver2
Thu Aug 3 00:43:32 2017[1,2]:+ grep paddle_cluster_job
Thu Aug 3 00:43:32 2017[1,2]:+ grep -v grep
Thu Aug 3 00:43:32 2017[1,2]:+ cut -c10-14
Thu Aug 3 00:43:32 2017[1,2]:+ xargs kill -9
Thu Aug 3 00:43:32 2017[1,2]:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Thu Aug 3 00:43:32 2017[1,2]:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Thu Aug 3 00:43:32 2017[1,2]:[./common.sh : 399] [kill_pserver2_exit]

the job link is:
http://10.73.218.49:8920/fileview.html?path=/home/normandy/maybach/280706/workspace/log

@yinyunfeng yinyunfeng changed the title paddle v1 大模型,paddle::GradientMachine::randParameters() paddle 大模型,paddle::GradientMachine::randParameters() Aug 3, 2017
@typhoonzero typhoonzero changed the title paddle 大模型,paddle::GradientMachine::randParameters() paddle 大模型,paddle::GradientMachine::randParameters() Segmentation fault Aug 3, 2017
@wanghaoshuang
Copy link
Contributor

wanghaoshuang commented Aug 3, 2017

把网络配置贴一下?像是parameters没有写对。

85 void GradientMachine::randParameters() {
 86   LOG(INFO) << "Initing parameters..";
 87
 88   for (auto& para : parameters_) {
 89     if (para->isFullSize()) {
 90       para->randomize();
 91     }
 92   }
 93   LOG(INFO) << "Init parameters done.";
 94 }

如上代码,应该是parameters_为空;

@wanghaoshuang
Copy link
Contributor

@yinyunfeng 这个配置在本地跑通过么?

@yinyunfeng
Copy link
Author

yinyunfeng commented Aug 3, 2017

Inputs("feature", "label")

Layer(name = "feature", type = "data", size = feature_size)
Layer(name = "label", type = "data",  size = 1)

Layer(
        name = "embedding_fea",
        type = "mixed",
        active_type='',
        size = dnn_layer_dims[0],
        bias = False,
        inputs = TableProjection("feature", parameter_name = "_emb_", decay_rate=1e-4, initial_std=0.02,learning_rate=1, sparse_remote_update = True),
        #inputs = TableProjection("feature", parameter_name = "_emb_", decay_rate=1e-4, initial_std=0.02,learning_rate=1),
        )

Layer(
        inputs = [Input("embedding_fea")],
        name = "emb_avg",
        active_type = "tanh",
        type = "average",
        average_strategy="sum",
        trans_type = "non-seq"
)

prev_layer = "emb_avg"
for i, dim in enumerate(dnn_layer_dims[1:]):
    Layer(
            inputs = [Input(prev_layer)],
            name = "hidden-%d" % (i + 1),
            active_type = "relu",
            type = "fc",
            size = dim)
    prev_layer = "hidden-%d" % (i + 1)

Layer(
        inputs = [Input(prev_layer)],
        name = "hidden-last",
        active_type = "softmax",
        type = "fc",
        size = 2)

Layer(inputs = ["hidden-last", 'label'], name = "cost", type = "multi-class-cross-entropy")
#Layer(inputs = ["hidden-last", 'label'], name = "cost", type = "multi_binary_label_cross_entropy")

Evaluator(inputs = ["hidden-last", "label"], name = "auc", type = "last-column-auc")

@yinyunfeng
Copy link
Author

本地可以跑通,没问题

@typhoonzero
Copy link
Contributor

@yinyunfeng 可以看下配置的优化算法是什么么?Adam和AdaDelta是不支持sparse的。

@yinyunfeng
Copy link
Author

sgd @typhoonzero

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants