We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My program works well in local, but when I set 'is_local=False', an error occurs.
I submitted the job by this way:
paddlecloud submit -jobname my-paddlecloud-job -cpu 2 -gpu 0 -memory 4Gi -parallelism 4 -pscpu 1 -pservers 2 -psmemory 1Gi -passes 1 -entry "python trainer_config.py" /pfs/[datacenter_name]/home/[username]/ctr_demo_package
Here is the error information:
==========================dpt-l1-sync-test-trainer-v0t8n========================== label selector: paddle-job-pserver=dpt-l1-sync-test, desired: 2 current cnt: 1 sleep for 5 seconds... label selector: paddle-job=dpt-l1-sync-test, desired: 2 Starting training job: /pfs/mulan/home/wangkairui@baidu.com/jobs/dpt-l1-sync-test, num_gradient_servers: 2, trainer_id: 1, version: v2 [INFO 2018-03-29 09:25:02,441 train.py:55] class number is : 28. [INFO 2018-03-29 09:25:02,460 train.py:75] length of word dictionary is : 40201. I0329 09:25:03.139626 131 Util.cpp:166] commandline: --num_gradient_servers=2 --ports_num_for_sparse=1 --use_gpu=False --trainer_id=1 --pservers=192.168.170.133,192.168.32.36 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164 I0329 09:25:03.203243 131 GradientMachine.cpp:94] Initing parameters.. I0329 09:25:03.592674 131 GradientMachine.cpp:101] Init parameters done. I0329 09:25:03.593145 131 ParameterClient2.cpp:113] pserver 0 192.168.170.133:7164 I0329 09:25:03.593410 131 ParameterClient2.cpp:113] pserver 1 192.168.32.36:7164 processing /pfs/mulan/home/wangkairui@baidu.com/ques_sync_test/train_data/train-00001 [INFO 2018-03-29 09:25:09,039 train.py:110] Pass 0, trainer 1, Batch 0, Cost 3.327363, {'__auc_evaluator_0__': 0.0, 'classification_error_evaluator': 0.875} F0329 09:25:40.119244 171 SocketChannel.cpp:54] Check failed: len >= 0 peer=192.168.170.133 *** Check failure stack trace: *** @ 0x7f2081f506fd google::LogMessage::Fail() @ 0x7f2081f541ac google::LogMessage::SendToLog() @ 0x7f2081f50223 google::LogMessage::Flush() @ 0x7f2081f556be google::LogMessageFatal::~LogMessageFatal() @ 0x7f2081db51c4 paddle::SocketChannel::read() @ 0x7f2081db56b0 paddle::SocketChannel::readMessage() @ 0x7f2081db64e6 paddle::ProtoClient::recv() @ 0x7f20824b93d4 paddle::ParameterClient2::sendParallel() @ 0x7f2081ebff7c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv @ 0x7f20ae8c4c80 (unknown) @ 0x7f20b97696ba start_thread @ 0x7f20b949f3dd clone @ (nil) (unknown) Aborted job returned 134...setting pod return message... =============================== termination log wroted...
The text was updated successfully, but these errors were encountered:
Seems Parameter Server failed, please try to turn up the memory of PServer by passing arg -psmemory, such as -psmemory 5Gi
Parameter Server
-psmemory
-psmemory 5Gi
Sorry, something went wrong.
No branches or pull requests
My program works well in local, but when I set 'is_local=False', an error occurs.
I submitted the job by this way:
paddlecloud submit -jobname my-paddlecloud-job
-cpu 2
-gpu 0
-memory 4Gi
-parallelism 4
-pscpu 1
-pservers 2
-psmemory 1Gi
-passes 1
-entry "python trainer_config.py"
/pfs/[datacenter_name]/home/[username]/ctr_demo_package
Here is the error information:
The text was updated successfully, but these errors were encountered: