Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging large model sparse training issue #5063

Closed
typhoonzero opened this issue Oct 25, 2017 · 1 comment
Closed

Debugging large model sparse training issue #5063

typhoonzero opened this issue Oct 25, 2017 · 1 comment

Comments

@typhoonzero
Copy link
Contributor

typhoonzero commented Oct 25, 2017

  • This is an issue which records my work to debugging the issue of training large CTR model with distributed sparse remote parameter updating.

Background

In CTR model training, we can use a very large feature in the LR part of the model, causing the models size is not able to store in one trainer even it's in the "sparse row format". So we need to make this part of model store evenly on the pservers and trainers can only fetch part of the rows in prefetch.

Refer to here for some details. This feature should be re-written in the refactored code.

Records

Using V1 CTR model config(wide part)

def widectr_net():
    signs = data_layer("feasigns", int(1e2))
    lr = fc_layer(input=signs, size=128,  act=SigmoidActivation(), param_attr=ParamAttr(sparse_update=True))
    return lr

Start 10 pservers and 20 trainers, trainer command args:
/usr/local/bin/paddle_trainer --port=7164 --nics=eth0 --ports_num=1 --ports_num_for_sparse=1 --num_passes=1 --trainer_count=1 --saving_period=1 --log_period=20 --local=0 --rdma_tcp=tcp --config=train.py --use_gpu=0 --trainer_id=8 --save_dir= --pservers=...... --num_gradient_servers=20 --loadsave_parameters_in_pserver=1 --use_old_updater=1 -v 100

Then trainer stuck at calling "add gradient", but the prefetch is OK. Then the trainer fails with "timeout". Some logs:

Tips: updatemode: 3(PSERVER_UPDATE_MODE_ADD_GRADIENT), 6(PSERVER_UPDATE_MODE_GET_PARAM_SPARSE)

I1025 01:58:37.992717    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992750    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992755    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992758    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992763    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992766    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992769    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992772    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992776    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992780    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode6 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 0 cost: 0 batch_status: 3
I1025 01:58:37.992873    71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 8
I1025 01:58:37.992893    71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 18
I1025 01:58:37.992897    71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 28
I1025 01:58:37.992899    71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 38
I1025 01:58:37.992902    71 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 0 tid 0 blockId 48
...
I1025 01:58:37.993465    77 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 6 tid 6 blockId 84
I1025 01:58:37.993469    77 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 6 tid 6 blockId 94
I1025 01:58:37.993535    71 ParameterClient2.cpp:166] sendParallel, tid: 0 numMyClients 1 numThreads 10
I1025 01:58:37.993538    72 ParameterClient2.cpp:166] sendParallel, tid: 1 numMyClients 1 numThreads 10
I1025 01:58:37.993541    74 ParameterClient2.cpp:166] sendParallel, tid: 3 numMyClients 1 numThreads 10
I1025 01:58:37.993547    71 ParameterClient2.cpp:174] #### before recv, i: 8
I1025 01:58:37.993548    72 ParameterClient2.cpp:174] #### before recv, i: 9
I1025 01:58:37.993541    78 ParameterClient2.cpp:166] sendParallel, tid: 7 numMyClients 1 numThreads 10
I1025 01:58:37.993576    77 ParameterClient2.cpp:166] sendParallel, tid: 6 numMyClients 1 numThreads 10
I1025 01:58:37.993587    79 ParameterClient2.cpp:166] sendParallel, tid: 8 numMyClients 1 numThreads 10
I1025 01:58:37.993553    74 ParameterClient2.cpp:174] #### before recv, i: 1
I1025 01:58:37.993597    78 ParameterClient2.cpp:174] #### before recv, i: 5
I1025 01:58:37.993599    79 ParameterClient2.cpp:174] #### before recv, i: 6
I1025 01:58:37.993558    73 ParameterClient2.cpp:166] sendParallel, tid: 2 numMyClients 1 numThreads 10
I1025 01:58:37.993538    75 ParameterClient2.cpp:166] sendParallel, tid: 4 numMyClients 1 numThreads 10
I1025 01:58:37.993597    77 ParameterClient2.cpp:174] #### before recv, i: 4
I1025 01:58:37.993616    75 ParameterClient2.cpp:174] #### before recv, i: 2
I1025 01:58:37.993569    80 ParameterClient2.cpp:166] sendParallel, tid: 9 numMyClients 1 numThreads 10
I1025 01:58:37.993613    73 ParameterClient2.cpp:174] #### before recv, i: 0
I1025 01:58:37.993558    76 ParameterClient2.cpp:166] sendParallel, tid: 5 numMyClients 1 numThreads 10
I1025 01:58:37.993628    80 ParameterClient2.cpp:174] #### before recv, i: 7
I1025 01:58:37.993633    76 ParameterClient2.cpp:174] #### before recv, i: 3
I1025 01:58:38.435159    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435159    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435195    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435205    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435209    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435211    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435214    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435215    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435217    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435220    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435222    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435225    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435227    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435232    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435237    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435241    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435241    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435247    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435252    57 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 1 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435256    58 ParameterClient2.cpp:228] request: trainer_id: 8 update_mode3 send_back_parameter: 0 send_back_parameter_type: 0 num_samples: 100 cost: 0 batch_status: 3
I1025 01:58:38.435319    75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 2
I1025 01:58:38.435331    75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 12
I1025 01:58:38.435336    75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 22
I1025 01:58:38.435340    75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 32
I1025 01:58:38.435344    75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 42
I1025 01:58:38.435348    75 ParameterClient2.cpp:280] prepareSendData sparse in thread, serverId: 4 tid 4 blockId 52
...
I1025 01:58:38.437079    74 ParameterClient2.cpp:166] sendParallel, tid: 3 numMyClients 1 numThreads 10
I1025 01:58:38.437093    74 ParameterClient2.cpp:174] #### before recv, i: 1
I1025 01:58:38.437126    77 ParameterClient2.cpp:166] sendParallel, tid: 6 numMyClients 1 numThreads 10
I1025 01:58:38.437077    75 ParameterClient2.cpp:166] sendParallel, tid: 4 numMyClients 1 numThreads 10
I1025 01:58:38.437081    76 ParameterClient2.cpp:166] sendParallel, tid: 5 numMyClients 1 numThreads 10
I1025 01:58:38.437134    77 ParameterClient2.cpp:174] #### before recv, i: 4
I1025 01:58:38.437167    72 ParameterClient2.cpp:166] sendParallel, tid: 1 numMyClients 1 numThreads 10
I1025 01:58:38.437170    78 ParameterClient2.cpp:166] sendParallel, tid: 7 numMyClients 1 numThreads 10
I1025 01:58:38.437081    80 ParameterClient2.cpp:166] sendParallel, tid: 9 numMyClients 1 numThreads 10
I1025 01:58:38.437150    75 ParameterClient2.cpp:174] #### before recv, i: 2
I1025 01:58:38.437180    80 ParameterClient2.cpp:174] #### before recv, i: 7
I1025 01:58:38.437172    72 ParameterClient2.cpp:174] #### before recv, i: 9
I1025 01:58:38.437209    71 ParameterClient2.cpp:166] sendParallel, tid: 0 numMyClients 1 numThreads 10
I1025 01:58:38.437213    71 ParameterClient2.cpp:174] #### before recv, i: 8
I1025 01:58:38.437163    76 ParameterClient2.cpp:174] #### before recv, i: 3
I1025 01:58:38.437249    73 ParameterClient2.cpp:166] sendParallel, tid: 2 numMyClients 1 numThreads 10
I1025 01:58:38.437255    73 ParameterClient2.cpp:174] #### before recv, i: 0
I1025 01:58:38.437178    78 ParameterClient2.cpp:174] #### before recv, i: 5
I1025 01:58:38.437134    79 ParameterClient2.cpp:166] sendParallel, tid: 8 numMyClients 1 numThreads 10
I1025 01:58:38.437306    79 ParameterClient2.cpp:174] #### before recv, i: 6
I1025 01:58:38.636719    87 ParameterClient2.cpp:166] sendParallel, tid: 6 numMyClients 1 numThreads 10
I1025 01:58:38.636740    87 ParameterClient2.cpp:174] #### before recv, i: 4
I1025 01:58:38.644503    89 ParameterClient2.cpp:166] sendParallel, tid: 8 numMyClients 1 numThreads 10
I1025 01:58:38.644520    89 ParameterClient2.cpp:174] #### before recv, i: 6
I1025 01:58:38.649602    90 ParameterClient2.cpp:166] sendParallel, tid: 9 numMyClients 1 numThreads 10
I1025 01:58:38.649615    90 ParameterClient2.cpp:174] #### before recv, i: 7
I1025 01:58:38.650900    85 ParameterClient2.cpp:166] sendParallel, tid: 4 numMyClients 1 numThreads 10
I1025 01:58:38.650910    85 ParameterClient2.cpp:174] #### before recv, i: 2
I1025 01:58:38.659765    83 ParameterClient2.cpp:166] sendParallel, tid: 2 numMyClients 1 numThreads 10
I1025 01:58:38.659776    83 ParameterClient2.cpp:174] #### before recv, i: 0
I1025 01:58:38.669888    88 ParameterClient2.cpp:166] sendParallel, tid: 7 numMyClients 1 numThreads 10
I1025 01:58:38.669898    88 ParameterClient2.cpp:174] #### before recv, i: 5
I1025 01:58:38.703678    82 ParameterClient2.cpp:166] sendParallel, tid: 1 numMyClients 1 numThreads 10
I1025 01:58:38.703691    82 ParameterClient2.cpp:174] #### before recv, i: 9
I1025 01:58:38.715457    84 ParameterClient2.cpp:166] sendParallel, tid: 3 numMyClients 1 numThreads 10
I1025 01:58:38.715476    84 ParameterClient2.cpp:174] #### before recv, i: 1
I1025 01:58:38.758709    81 ParameterClient2.cpp:166] sendParallel, tid: 0 numMyClients 1 numThreads 10
I1025 01:58:38.758720    81 ParameterClient2.cpp:174] #### before recv, i: 8
I1025 01:58:38.780829    86 ParameterClient2.cpp:166] sendParallel, tid: 5 numMyClients 1 numThreads 10
I1025 01:58:38.780840    86 ParameterClient2.cpp:174] #### before recv, i: 3

Some of the pserver fails at:

I1025 01:58:17.467772    82 ParameterServer2.cpp:564] pserver: getParameter
I1025 01:58:18.902704    83 LightNetwork.cpp:326] worker started, peer = 192.168.27.222
I1025 01:58:18.926435    84 LightNetwork.cpp:326] worker started, peer = 192.168.27.222
I1025 01:58:20.928249    84 ParameterServer2.cpp:564] pserver: getParameter
I1025 01:58:35.682245    85 LightNetwork.cpp:326] worker started, peer = 192.168.139.150
I1025 01:58:35.705991    86 LightNetwork.cpp:326] worker started, peer = 192.168.139.150
I1025 01:58:37.707690    86 ParameterServer2.cpp:564] pserver: getParameter
F1025 01:58:52.261445    48 SocketChannel.cpp:101] Check failed: len > 0  peer=192.168.24.151 curIov=22 iovCnt=89 iovs[curIov].base=0x7fa1b09d75ca iovs[curIov].iov_len=10870
*** Check failure stack trace: ***
    @           0xa5904d  google::LogMessage::Fail()
    @           0xa5b398  google::LogMessage::SendToLog()
    @           0xa58b5b  google::LogMessage::Flush()
    @           0xa5c26e  google::LogMessageFatal::~LogMessageFatal()
    @           0x884a04  paddle::SocketChannel::writev()
    @           0x885b98  paddle::SocketChannel::writeMessage()
    @           0x8794cc  _ZZZN6paddle11ProtoServer25registerServiceFunctionExINS_15SendDataRequestEEEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt8functionIFvRKT_St10unique_ptrINS_9MsgReaderESt14default_deleteISG_EESB_IFvRKN6google8protobuf11MessageLiteERKSt6vectorI5iovecSaISQ_EEEEEEENKUlSJ_SB_IFvSU_EEE_clESJ_S10_ENKUlSO_SU_E_clESO_SU_
    @           0x86e4ba  paddle::ParameterServer2::sendParameter()
    @           0x876c5a  std::_Function_handler<>::_M_invoke()
    @           0x87a3de  _ZNSt17_Function_handlerIFvSt10unique_ptrIN6paddle9MsgReaderESt14default_deleteIS2_EESt8functionIFvRKSt6vectorI5iovecSaIS8_EEEEEZNS1_11ProtoServer25registerServiceFunctionExINS1_20SendParameterRequestEEEvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_IFvRKT_S5_S6_IFvRKN6google8protobuf11MessageLiteESC_EEEEEUlS5_SE_E_E9_M_invokeERKSt9_Any_dataOS5_OSE_
    @           0x88648a  paddle::ProtoServer::handleRequest()
    @           0x88412f  paddle::SocketWorker::run()
    @     0x7fa235f1ac80  (unknown)
    @     0x7fa2363ef6ba  start_thread
    @     0x7fa2356803dd  clone
    @              (nil)  (unknown)
/usr/local/bin/paddle: line 96:    27 Aborted                 ${DEBUGGER} $PADDLE_BIN_PATH/paddle_pserver_main ${@:2}
@typhoonzero
Copy link
Contributor Author

Update:

The above error may due to cluster network problem or just a bug. Training with 1 pserver and 1 trainer with deep+wide model seems OK(I shrink the wide feature size to 100, but actual size may be 1e11).

This #5077 should also be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant