Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A GPT model based on the triton-with-ft always generates a sequence with a length of request_max_output_len instead of ending generation with the eos_id. #577

Closed
songkq opened this issue Apr 24, 2023 · 2 comments

Comments

@songkq
Copy link

songkq commented Apr 24, 2023

@byshiue Hi, could you please give some advice for this issue?
A GPT model based on the triton-with-ft always generate a sequence with a length of request_max_output_len. It cannot end to generate tokens even with eos_id. Once request_max_output_len was set done, the elapsed_time keep the same regardless of the lenghth of input query and output_sequence_length.

model: nemo-megatron-gpt-5B

def build_request_data(query, request_max_output_len, eos_id):
    
    request_data = []
    request = np.array([query]).astype(np.uint32)
    request_len = np.array([[len(query)]]).astype(np.uint32)
    request_output_len = np.array([[request_max_output_len]]).astype(np.uint32)
    top_k = np.array([[4]]).astype(np.uint32)
    top_p = np.array([[0.9]]).astype(np.float32)
    temperature = np.array([[0.9]]).astype(np.float32)
    end_ids = eos_id * np.ones([request.shape[0], 1]).astype(np.uint32)

    request_data.append(fill_input('input_ids', request))
    request_data.append(fill_input('input_lengths', request_len))
    request_data.append(fill_input('request_output_len', request_output_len))
    request_data.append(fill_input('runtime_top_k', top_k))
    request_data.append(fill_input('runtime_top_p', top_p))
    request_data.append(fill_input('temperature', temperature))
    request_data.append(fill_input('end_id', end_ids))
    
    return request_data


inputs = build_request_data(self.query, self.output_seq_len)
print("set request")
start_time = time.time()
results = self.client_.infer(model_name=self.model_name_, inputs=inputs, compression_algorithm='gzip')
elapsed_time = time.time() - start_time
print("get request")
print(f"[debug] elapsed_time = {elapsed_time:.2f} s")
@byshiue
Copy link
Collaborator

byshiue commented Apr 24, 2023

This is a know issue, please refer this issue #487.

@songkq songkq changed the title A GPT model based on the triton-with-ft always generate a sequence with a length of request_max_output_len instead of ending generation with the eos_id. A GPT model based on the triton-with-ft always generates a sequence with a length of request_max_output_len instead of ending generation with the eos_id. Apr 24, 2023
@songkq
Copy link
Author

songkq commented Apr 24, 2023

@byshiue Thanks.

@songkq songkq closed this as completed Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants