A GPT model based on the triton-with-ft always generates a sequence with a length of request_max_output_len instead of ending generation with the eos_id. #577

songkq · 2023-04-24T08:12:45Z

@byshiue Hi, could you please give some advice for this issue?
A GPT model based on the triton-with-ft always generate a sequence with a length of request_max_output_len. It cannot end to generate tokens even with eos_id. Once request_max_output_len was set done, the elapsed_time keep the same regardless of the lenghth of input query and output_sequence_length.

model: nemo-megatron-gpt-5B

def build_request_data(query, request_max_output_len, eos_id):
    
    request_data = []
    request = np.array([query]).astype(np.uint32)
    request_len = np.array([[len(query)]]).astype(np.uint32)
    request_output_len = np.array([[request_max_output_len]]).astype(np.uint32)
    top_k = np.array([[4]]).astype(np.uint32)
    top_p = np.array([[0.9]]).astype(np.float32)
    temperature = np.array([[0.9]]).astype(np.float32)
    end_ids = eos_id * np.ones([request.shape[0], 1]).astype(np.uint32)

    request_data.append(fill_input('input_ids', request))
    request_data.append(fill_input('input_lengths', request_len))
    request_data.append(fill_input('request_output_len', request_output_len))
    request_data.append(fill_input('runtime_top_k', top_k))
    request_data.append(fill_input('runtime_top_p', top_p))
    request_data.append(fill_input('temperature', temperature))
    request_data.append(fill_input('end_id', end_ids))
    
    return request_data


inputs = build_request_data(self.query, self.output_seq_len)
print("set request")
start_time = time.time()
results = self.client_.infer(model_name=self.model_name_, inputs=inputs, compression_algorithm='gzip')
elapsed_time = time.time() - start_time
print("get request")
print(f"[debug] elapsed_time = {elapsed_time:.2f} s")

The text was updated successfully, but these errors were encountered:

byshiue · 2023-04-24T08:14:56Z

This is a know issue, please refer this issue #487.

songkq · 2023-04-24T08:36:48Z

@byshiue Thanks.

songkq closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A GPT model based on the triton-with-ft always generates a sequence with a length of request_max_output_len instead of ending generation with the eos_id. #577

A GPT model based on the triton-with-ft always generates a sequence with a length of request_max_output_len instead of ending generation with the eos_id. #577

songkq commented Apr 24, 2023

byshiue commented Apr 24, 2023

songkq commented Apr 24, 2023

A GPT model based on the triton-with-ft always generates a sequence with a length of request_max_output_len instead of ending generation with the eos_id. #577

A GPT model based on the triton-with-ft always generates a sequence with a length of request_max_output_len instead of ending generation with the eos_id. #577

Comments

songkq commented Apr 24, 2023

byshiue commented Apr 24, 2023

songkq commented Apr 24, 2023