-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop the generation if the eod is reached #526
Comments
This is a know issue, please refer this issue #487. |
I used the tmp/fix_gpt_earlystop branch, but still have the same problem |
Please provide your reproduced steps. |
The issue is fixed in the MR #584 and merge in main branch. Sorry for the late fixing. |
Hello, I wonder if #584 also applies to GPT-J? I am testing inferences with tritonserver's fastertransformer backend and GPT-J converted model, and it takes as much as time in proportion to request_output_len. If I request 30 tokens, it takses about 150ms, and last 5 tokens are eos tokens. |
@byshiue thanks for fixing in parallel gpt 😌 gptneox and gptj have the same bug? When I tested gptneox I met an same bug... Do you have plans for fix them? |
@byshiue Thanks for fixing the problem. but I found that when running Bloom on a single GPU, it still cannot stop after generating the eos token, and it will not stop until the maximum length. I added a check for ParallelGpt.cc: PUSH_RANGE("result sampling and stop check");
dynamic_decode_layer_->forward(&dynamic_decode_output_tensors, &dynamic_decode_input_tensors);
+// check finished
+cudaD2Hcpy(h_finished_buf_, finished_buf_, batch_size * beam_width);
+uint sum = 0;
+for (uint i = 0; i < batch_size * beam_width; i++) {
+ sum += (int)h_finished_buf_[i];
+}
+if (sum == batch_size * beam_width) {
+ subbatch_should_stop = true;
+}
*generation_should_stop_ &= subbatch_should_stop;
POP_RANGE; |
I play with
examples/cpp/gpt/gpt_example.cc
and found that generation of tokens does't finish when the first EOD is reached.This is my gpt_config.ini.
I use gpt2 model which was converted into FT format
For example if I set
and feed tokens [4919, 389] as input
I receive log output
and result (103 sensible tokens at the beginning and 27 eod in the tail)
whole computation took
149.00 ms
BUT
when I set
I receive log output
and result (103 sensible tokens (same as previous run, it is ok because seed is fixed) at the beginning and 899 eod in the tail)
whole computation took
810.24 ms
The question is why the generation of EOD tokens consumes time?
The text was updated successfully, but these errors were encountered: