Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed training pserver is not tracking number of trainers #7465

Closed
putcn opened this issue Jan 11, 2018 · 6 comments
Closed

distributed training pserver is not tracking number of trainers #7465

putcn opened this issue Jan 11, 2018 · 6 comments
Assignees

Comments

@putcn
Copy link
Contributor

putcn commented Jan 11, 2018

this issue is related to #7422

number of trainers is set 2 when trainspiling a program for distributed training, when started with only one trainer, the training process is stuck at 1st pass, which is expected.
but when the 1st trainer gets terminated and started again, the training keeps going without getting stuck at 2nd pass.
looks like the program for pserver is not tracking number of trainers.

@Yancey1989
Copy link
Contributor

looks like the program for pserver is not tracking number of trainers.

Yes, for the latest implement, pserver will wait for enough gradient variable number, but don't care about which trainer do these gradient variables come from.

@helinwang
Copy link
Contributor

helinwang commented Jan 15, 2018

@Yancey1989 If there are two trainers, 5 gradient variable per trainer, shouldn't it be stuck, rather than continue running in this following case?

trainer 1 send grad -> trainer 1 killed -> trainer 1 restarted -> trainer 1 send grad -> 10 grad, update parameter -> trainer 1 recv parameter -> trainer 1 send grad -> only 5 grad, stuck (trainer 1 have not recv parameter, so will not send grad again).

@Yancey1989
Copy link
Contributor

Indeed, the case above will cause the training stuck, and I think it's a problem that should be solved.
For my personal concept, PServer should check all received variables, and maybe we could refine Send/Recv Op as following:

  • Add trainer ID for each trainer instance.
  • Including trainer ID in message proto between Send/Recv Op.
  • Remove duplicate variables based on trainer ID while Recv Op receiving the variables.

@typhoonzero
Copy link
Contributor

Since we currently do not implement fault tolerant yet in fluid, so just don't restart the failed trainer for now. And on k8s, it's a job without restarting, so just let it fail for now. Then we reconsider the fault tolerant design.

@helinwang
Copy link
Contributor

I think this is fixed, now SYNC SGD seems to be working (e.g., close 1 of the 2 trainer, the training will get stuck).

@putcn
Copy link
Contributor Author

putcn commented Jan 19, 2018

@helinwang great, let me close this one for now.

@putcn putcn closed this as completed Jan 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants