-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed training pserver is not tracking number of trainers #7465
Comments
Yes, for the latest implement, pserver will wait for enough gradient variable number, but don't care about which trainer do these gradient variables come from. |
@Yancey1989 If there are two trainers, 5 gradient variable per trainer, shouldn't it be stuck, rather than continue running in this following case? trainer 1 send grad -> trainer 1 killed -> trainer 1 restarted -> trainer 1 send grad -> 10 grad, update parameter -> trainer 1 recv parameter -> trainer 1 send grad -> only 5 grad, stuck (trainer 1 have not recv parameter, so will not send grad again). |
Indeed, the case above will cause the training stuck, and I think it's a problem that should be solved.
|
Since we currently do not implement fault tolerant yet in fluid, so just don't restart the failed trainer for now. And on k8s, it's a job without restarting, so just let it fail for now. Then we reconsider the fault tolerant design. |
I think this is fixed, now SYNC SGD seems to be working (e.g., close 1 of the 2 trainer, the training will get stuck). |
@helinwang great, let me close this one for now. |
this issue is related to #7422
number of trainers is set 2 when trainspiling a program for distributed training, when started with only one trainer, the training process is stuck at 1st pass, which is expected.
but when the 1st trainer gets terminated and started again, the training keeps going without getting stuck at 2nd pass.
looks like the program for pserver is not tracking number of trainers.
The text was updated successfully, but these errors were encountered: