You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an issue which records my work to debugging the issue of training large CTR model with distributed sparse remote parameter updating.
Background
In CTR model training, we can use a very large feature in the LR part of the model, causing the models size is not able to store in one trainer even it's in the "sparse row format". So we need to make this part of model store evenly on the pservers and trainers can only fetch part of the rows in prefetch.
Refer to here for some details. This feature should be re-written in the refactored code.
The above error may due to cluster network problem or just a bug. Training with 1 pserver and 1 trainer with deep+wide model seems OK(I shrink the wide feature size to 100, but actual size may be 1e11).
Background
In CTR model training, we can use a very large feature in the LR part of the model, causing the models size is not able to store in one trainer even it's in the "sparse row format". So we need to make this part of model store evenly on the pservers and trainers can only fetch part of the rows in prefetch.
Refer to here for some details. This feature should be re-written in the refactored code.
Records
Using V1 CTR model config(wide part)
Start 10 pservers and 20 trainers, trainer command args:
/usr/local/bin/paddle_trainer --port=7164 --nics=eth0 --ports_num=1 --ports_num_for_sparse=1 --num_passes=1 --trainer_count=1 --saving_period=1 --log_period=20 --local=0 --rdma_tcp=tcp --config=train.py --use_gpu=0 --trainer_id=8 --save_dir= --pservers=...... --num_gradient_servers=20 --loadsave_parameters_in_pserver=1 --use_old_updater=1 -v 100
Then trainer stuck at calling "add gradient", but the prefetch is OK. Then the trainer fails with "timeout". Some logs:
Tips: updatemode: 3(
PSERVER_UPDATE_MODE_ADD_GRADIENT
), 6(PSERVER_UPDATE_MODE_GET_PARAM_SPARSE
)Some of the pserver fails at:
The text was updated successfully, but these errors were encountered: