Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There occurs an error when I set 'is_local=False' #650

Open
wkr114 opened this issue Mar 29, 2018 · 1 comment
Open

There occurs an error when I set 'is_local=False' #650

wkr114 opened this issue Mar 29, 2018 · 1 comment
Labels

Comments

@wkr114
Copy link

wkr114 commented Mar 29, 2018

My program works well in local, but when I set 'is_local=False', an error occurs.

I submitted the job by this way:

paddlecloud submit -jobname my-paddlecloud-job
-cpu 2
-gpu 0
-memory 4Gi
-parallelism 4
-pscpu 1
-pservers 2
-psmemory 1Gi
-passes 1
-entry "python trainer_config.py"
/pfs/[datacenter_name]/home/[username]/ctr_demo_package

Here is the error information:

==========================dpt-l1-sync-test-trainer-v0t8n==========================
label selector: paddle-job-pserver=dpt-l1-sync-test, desired: 2
current cnt: 1 sleep for 5 seconds...
label selector: paddle-job=dpt-l1-sync-test, desired: 2
Starting training job:  /pfs/mulan/home/wangkairui@baidu.com/jobs/dpt-l1-sync-test, num_gradient_servers: 2, trainer_id:  1, version:  v2
[INFO 2018-03-29 09:25:02,441 train.py:55] class number is : 28.
[INFO 2018-03-29 09:25:02,460 train.py:75] length of word dictionary is : 40201.
I0329 09:25:03.139626   131 Util.cpp:166] commandline:  --num_gradient_servers=2 --ports_num_for_sparse=1 --use_gpu=False --trainer_id=1 --pservers=192.168.170.133,192.168.32.36 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164 
I0329 09:25:03.203243   131 GradientMachine.cpp:94] Initing parameters..
I0329 09:25:03.592674   131 GradientMachine.cpp:101] Init parameters done.
I0329 09:25:03.593145   131 ParameterClient2.cpp:113] pserver 0 192.168.170.133:7164
I0329 09:25:03.593410   131 ParameterClient2.cpp:113] pserver 1 192.168.32.36:7164
processing  /pfs/mulan/home/wangkairui@baidu.com/ques_sync_test/train_data/train-00001
[INFO 2018-03-29 09:25:09,039 train.py:110] Pass 0, trainer 1, Batch 0, Cost 3.327363, {'__auc_evaluator_0__': 0.0, 'classification_error_evaluator': 0.875}

F0329 09:25:40.119244   171 SocketChannel.cpp:54] Check failed: len >= 0  peer=192.168.170.133
*** Check failure stack trace: ***
    @     0x7f2081f506fd  google::LogMessage::Fail()
    @     0x7f2081f541ac  google::LogMessage::SendToLog()
    @     0x7f2081f50223  google::LogMessage::Flush()
    @     0x7f2081f556be  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f2081db51c4  paddle::SocketChannel::read()
    @     0x7f2081db56b0  paddle::SocketChannel::readMessage()
    @     0x7f2081db64e6  paddle::ProtoClient::recv()
    @     0x7f20824b93d4  paddle::ParameterClient2::sendParallel()
    @     0x7f2081ebff7c  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
    @     0x7f20ae8c4c80  (unknown)
    @     0x7f20b97696ba  start_thread
    @     0x7f20b949f3dd  clone
    @              (nil)  (unknown)
Aborted
job returned 134...setting pod return message...
===============================
termination log wroted...
@Yancey1989
Copy link
Collaborator

Yancey1989 commented Mar 29, 2018

Seems Parameter Server failed, please try to turn up the memory of PServer by passing arg -psmemory, such as -psmemory 5Gi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants