distributed training pserver is not tracking number of trainers #7465

putcn · 2018-01-11T22:13:20Z

this issue is related to #7422

number of trainers is set 2 when trainspiling a program for distributed training, when started with only one trainer, the training process is stuck at 1st pass, which is expected.
but when the 1st trainer gets terminated and started again, the training keeps going without getting stuck at 2nd pass.
looks like the program for pserver is not tracking number of trainers.

Yancey1989 · 2018-01-12T03:24:12Z

looks like the program for pserver is not tracking number of trainers.

Yes, for the latest implement, pserver will wait for enough gradient variable number, but don't care about which trainer do these gradient variables come from.

helinwang · 2018-01-15T18:52:36Z

@Yancey1989 If there are two trainers, 5 gradient variable per trainer, shouldn't it be stuck, rather than continue running in this following case?

trainer 1 send grad -> trainer 1 killed -> trainer 1 restarted -> trainer 1 send grad -> 10 grad, update parameter -> trainer 1 recv parameter -> trainer 1 send grad -> only 5 grad, stuck (trainer 1 have not recv parameter, so will not send grad again).

Yancey1989 · 2018-01-16T03:17:17Z

Indeed, the case above will cause the training stuck, and I think it's a problem that should be solved.
For my personal concept, PServer should check all received variables, and maybe we could refine Send/Recv Op as following:

Add trainer ID for each trainer instance.
Including trainer ID in message proto between Send/Recv Op.
Remove duplicate variables based on trainer ID while Recv Op receiving the variables.

typhoonzero · 2018-01-16T04:11:47Z

Since we currently do not implement fault tolerant yet in fluid, so just don't restart the failed trainer for now. And on k8s, it's a job without restarting, so just let it fail for now. Then we reconsider the fault tolerant design.

helinwang · 2018-01-19T21:24:53Z

I think this is fixed, now SYNC SGD seems to be working (e.g., close 1 of the 2 trainer, the training will get stuck).

putcn · 2018-01-19T23:12:59Z

@helinwang great, let me close this one for now.

putcn assigned helinwang and typhoonzero Jan 11, 2018

putcn closed this as completed Jan 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training pserver is not tracking number of trainers #7465

distributed training pserver is not tracking number of trainers #7465

putcn commented Jan 11, 2018

Yancey1989 commented Jan 12, 2018

helinwang commented Jan 15, 2018 •

edited

Loading

Yancey1989 commented Jan 16, 2018

typhoonzero commented Jan 16, 2018

helinwang commented Jan 19, 2018

putcn commented Jan 19, 2018

distributed training pserver is not tracking number of trainers #7465

distributed training pserver is not tracking number of trainers #7465

Comments

putcn commented Jan 11, 2018

Yancey1989 commented Jan 12, 2018

helinwang commented Jan 15, 2018 • edited Loading

Yancey1989 commented Jan 16, 2018

typhoonzero commented Jan 16, 2018

helinwang commented Jan 19, 2018

putcn commented Jan 19, 2018

helinwang commented Jan 15, 2018 •

edited

Loading