-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Petals doesn't deal with server failure properly #587
Comments
@oldcpple Thank you! (and sorry for the delayed response) Your reasoning looks sound. If you still have bandwidth, would you kindly provide a minimal example where this assert fails? (so we can reproduce it for CI tests) That would ensure that, once fixed, this error does not resurface. If you don't, please reply as such, and I will continue trying. Also, if you prefer implementing the fix in your own PR, you are most welcome to do that. If that's inconvenient, we will add you as "co-authored by" once we are certain that this fix does not have negative side-effects. |
Discussed this with @borzunov (core author) : he also believes that you are correct and the problem was introduced during one of our refactors. He hopes to make a pass this weekend (please note that this is not a binding promise, we are unfortunately still overwhelmed with research routine). |
@justheuristic Thanks for your reply! As you mentioned that your tests ended with correct results, we guess it might be a model-specified problem (but haven't varified this). Note that we met this problem with StableBeluga2, we further found another problem with this model: the batch size seems to be fixed. For example, we first tried to send a request with 2 prompts(batch size=2), the procedure of inference was fine. Then for the following requests, we changed batch size to 10 or any other except 2, similar "Broken input cache" exceptions would occured. We tested for a few times, and was sure that the batch size the model can process must be the same as the batch size of the first inference request. But we believe this as a problem of StableBeluga2, since everything was fine in our test on another model, bloom-7b. |
For the record: the issue certainly exists and I can confirm that it breaks fault tolerance on my side. We have not yet fixed it as the issue is bouncing between myself and @borzunov promising to "certainly fix this next weekend night". We hope to fix it and post an update, but everything takes unreasonably long at the moment. We are sorry. |
Hi there, we'd like to report our findings on testing Petals' availability of fault tolerance.
We note that the current implementation of the method step in the class _ServerInferenceSession from inference_session.py contains the following content:
where the attributes self.history and self.position are initialized as None and 0 respectively when an object of the class _ServerInferenceSession is created. The problem is, when a server fails, Petals replaces it with another server that serves the same blocks. However, the new server session is just initialized when joining the inference session, and its attribute position is 0.
In the method _update_sequence of the class InferenceSession, the new server session's history will be assigned the history of the failed server session:
updated_sessions[0].history = self._server_sessions[server_idx].history
And during the inference, n_input_tokens will always be 1. Thus, the assertion:
assert self.history.shape[1] == self._position + n_input_tokens
is always likely to throw exceptions of "Broken input cache".
One possible solution is described as follow:
Delete the assert statement so that no exception will be thrown during the recovery process. Then, change the last few lines of code of method step to:
In which the self.recover is a newly difined attribute of class _ServerInferenceSession, representing whether or not this server session is to recover from a failed one, initialized as False, and will be set to True in the method _update_sequence. This change is to tackle the problem that: when simply delete the assert statement, the returned value of outputs[0] in the recoverd session will be a tensor of shape [1, (num of it's history inputs), 8192] instead of expected [1,1,8192].
By testing tens of examples, we believe this change work properly when dealing with server failures. The final outputs in case some server fail, are the same as the ones where no server fails.
Please check if there are such problems. Many thanks.
The text was updated successfully, but these errors were encountered: