paddle 大模型，paddle::GradientMachine::randParameters() Segmentation fault #3199

yinyunfeng · 2017-08-03T02:11:52Z

the error info：

Thu Aug 3 00:42:46 2017[1,2]:+ ./paddle_trainer --num_gradient_servers=6 --trainer_id=2 --pservers=10.73.218.49,10.73.218.48,10.73.218.45,10.73.226.18,10.73.226.16,10.73.226.21 --rdma_tcp=tcp --nics=xgbe0 --port=7164 --ports_num=1 --dot_period=100 --use_old_updater=1 --test_all_data_in_one_period=true --pserver_num_threads=5 --loadsave_parameters_in_pserver=1 --log_period=100 --trainer_count=16 --ports_num_for_sparse=1 --num_passes=50 --saving_period=1 --local=0 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0
Thu Aug 3 00:43:28 2017[1,2]:*** Aborted at 1501692208 (unix time) try "date -d @1501692208" if you are using GNU date ***
Thu Aug 3 00:43:28 2017[1,2]:PC: @ 0x6ec623 paddle::GradientMachine::randParameters()
Thu Aug 3 00:43:28 2017[1,2]:*** SIGSEGV (@0x30) received by PID 3852 (TID 0x7f1f79428780) from PID 48; stack trace: ***
Thu Aug 3 00:43:28 2017[1,2]: @ 0x7f1f79002160 (unknown)
Thu Aug 3 00:43:28 2017[1,2]: @ 0x6ec623 paddle::GradientMachine::randParameters()
Thu Aug 3 00:43:28 2017[1,2]: @ 0x72a73c paddle::Trainer::init()
Thu Aug 3 00:43:28 2017[1,2]: @ 0x575ed4 main
Thu Aug 3 00:43:28 2017[1,2]: @ 0x7f1f77c0fbd5 __libc_start_main
Thu Aug 3 00:43:28 2017[1,2]: @ 0x584cc1 (unknown)
Thu Aug 3 00:43:32 2017[1,2]:./train.sh: line 207: 3852 Segmentation fault PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Thu Aug 3 00:43:32 2017[1,2]:+ '[' 139 -ne 0 ']'
Thu Aug 3 00:43:32 2017[1,2]:+ kill_pserver2_exit
Thu Aug 3 00:43:32 2017[1,2]:+ ps aux
Thu Aug 3 00:43:32 2017[1,2]:+ grep paddle_pserver2
Thu Aug 3 00:43:32 2017[1,2]:+ grep paddle_cluster_job
Thu Aug 3 00:43:32 2017[1,2]:+ grep -v grep
Thu Aug 3 00:43:32 2017[1,2]:+ cut -c10-14
Thu Aug 3 00:43:32 2017[1,2]:+ xargs kill -9
Thu Aug 3 00:43:32 2017[1,2]:+ log_fatal 'paddle_trainer failed kill paddle_pserver2 and exit'
Thu Aug 3 00:43:32 2017[1,2]:+ echo '[./common.sh : 399] [kill_pserver2_exit]'
Thu Aug 3 00:43:32 2017[1,2]:[./common.sh : 399] [kill_pserver2_exit]

the job link is:
http://10.73.218.49:8920/fileview.html?path=/home/normandy/maybach/280706/workspace/log

wanghaoshuang · 2017-08-03T02:15:16Z

把网络配置贴一下？像是parameters没有写对。

85 void GradientMachine::randParameters() {
 86   LOG(INFO) << "Initing parameters..";
 87
 88   for (auto& para : parameters_) {
 89     if (para->isFullSize()) {
 90       para->randomize();
 91     }
 92   }
 93   LOG(INFO) << "Init parameters done.";
 94 }

如上代码，应该是parameters_为空；

wanghaoshuang · 2017-08-03T02:24:55Z

@yinyunfeng 这个配置在本地跑通过么？

yinyunfeng · 2017-08-03T02:25:23Z

Inputs("feature", "label")

Layer(name = "feature", type = "data", size = feature_size)
Layer(name = "label", type = "data",  size = 1)

Layer(
        name = "embedding_fea",
        type = "mixed",
        active_type='',
        size = dnn_layer_dims[0],
        bias = False,
        inputs = TableProjection("feature", parameter_name = "_emb_", decay_rate=1e-4, initial_std=0.02,learning_rate=1, sparse_remote_update = True),
        #inputs = TableProjection("feature", parameter_name = "_emb_", decay_rate=1e-4, initial_std=0.02,learning_rate=1),
        )

Layer(
        inputs = [Input("embedding_fea")],
        name = "emb_avg",
        active_type = "tanh",
        type = "average",
        average_strategy="sum",
        trans_type = "non-seq"
)

prev_layer = "emb_avg"
for i, dim in enumerate(dnn_layer_dims[1:]):
    Layer(
            inputs = [Input(prev_layer)],
            name = "hidden-%d" % (i + 1),
            active_type = "relu",
            type = "fc",
            size = dim)
    prev_layer = "hidden-%d" % (i + 1)

Layer(
        inputs = [Input(prev_layer)],
        name = "hidden-last",
        active_type = "softmax",
        type = "fc",
        size = 2)

Layer(inputs = ["hidden-last", 'label'], name = "cost", type = "multi-class-cross-entropy")
#Layer(inputs = ["hidden-last", 'label'], name = "cost", type = "multi_binary_label_cross_entropy")

Evaluator(inputs = ["hidden-last", "label"], name = "auc", type = "last-column-auc")

yinyunfeng · 2017-08-03T02:25:39Z

本地可以跑通，没问题

typhoonzero · 2017-08-11T08:42:08Z

@yinyunfeng 可以看下配置的优化算法是什么么？Adam和AdaDelta是不支持sparse的。

yinyunfeng · 2017-08-11T09:10:22Z

sgd @typhoonzero

yinyunfeng changed the title ~~paddle v1 大模型，paddle::GradientMachine::randParameters()~~ paddle 大模型，paddle::GradientMachine::randParameters() Aug 3, 2017

typhoonzero changed the title ~~paddle 大模型，paddle::GradientMachine::randParameters()~~ paddle 大模型，paddle::GradientMachine::randParameters() Segmentation fault Aug 3, 2017

typhoonzero mentioned this issue Aug 3, 2017

Paddle大模型训练的demo与复现 #3202

Closed

typhoonzero self-assigned this Aug 4, 2017

typhoonzero mentioned this issue Aug 16, 2017

Fix remote large update core #3518

Merged

typhoonzero closed this as completed in #3518 Aug 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paddle 大模型，paddle::GradientMachine::randParameters() Segmentation fault #3199

paddle 大模型，paddle::GradientMachine::randParameters() Segmentation fault #3199

yinyunfeng commented Aug 3, 2017 •

edited

Loading

wanghaoshuang commented Aug 3, 2017 •

edited

Loading

wanghaoshuang commented Aug 3, 2017

yinyunfeng commented Aug 3, 2017 •

edited by typhoonzero

Loading

yinyunfeng commented Aug 3, 2017

typhoonzero commented Aug 11, 2017

yinyunfeng commented Aug 11, 2017

paddle 大模型，paddle::GradientMachine::randParameters() Segmentation fault #3199

paddle 大模型，paddle::GradientMachine::randParameters() Segmentation fault #3199

Comments

yinyunfeng commented Aug 3, 2017 • edited Loading

wanghaoshuang commented Aug 3, 2017 • edited Loading

wanghaoshuang commented Aug 3, 2017

yinyunfeng commented Aug 3, 2017 • edited by typhoonzero Loading

yinyunfeng commented Aug 3, 2017

typhoonzero commented Aug 11, 2017

yinyunfeng commented Aug 11, 2017

yinyunfeng commented Aug 3, 2017 •

edited

Loading

wanghaoshuang commented Aug 3, 2017 •

edited

Loading

yinyunfeng commented Aug 3, 2017 •

edited by typhoonzero

Loading