set max_input_len, max_output_len=4096, But The actual input cannot reach this level #88

callmezhangchenchenokay · 2023-10-24T07:16:48Z

Baichuan2-13B-Chat

‘
python examples/baichuan/build.py --model_version v2_13b --max_input_len=4096 --max_output_len=4096 --model_dir ./models/Baichuan2-13B-Chat/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --use_weight_only --output_dir ./models/tmp/baichuan_v2_13b/trt_engines/fp16+flashattention+int8+4096/1-gpu/
’
Model conversion successful

‘
python examples/baichuan/run.py --model_version v2_13b --max_output_len=4096 --tokenizer_dir=./models/Baichuan2-13B-Chat/ --engine_dir=./models/tmp/baichuan_v2_13b/trt_engines/fp16+flashattention+int8+4096/1-gpu/
’

run.py Modified some content
''' python
input = """***""" # *** = print('input_tokens:',len(input_tokens[0])) The length of the input needs to be greater than around 2000 to make an error， But the max_output_len is 4096
parser.add_argument('--input_text', type=str, default=input)
'''

byshiue · 2023-10-24T09:26:45Z

Baichuan2-13B-Chat

‘ python examples/baichuan/build.py --model_version v2_13b --max_input_len=4096 --max_output_len=4096 --model_dir ./models/Baichuan2-13B-Chat/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --use_weight_only --output_dir ./models/tmp/baichuan_v2_13b/trt_engines/fp16+flashattention+int8+4096/1-gpu/ ’ Model conversion successful

‘ python examples/baichuan/run.py --model_version v2_13b --max_output_len=4096 --tokenizer_dir=./models/Baichuan2-13B-Chat/ --engine_dir=./models/tmp/baichuan_v2_13b/trt_engines/fp16+flashattention+int8+4096/1-gpu/ ’

run.py Modified some content ''' python input = """***""" # *** = print('input_tokens:',len(input_tokens[0])) The length of the input needs to be greater than around 2000 to make an error， But the max_output_len is 4096 parser.add_argument('--input_text', type=str, default=input) '''

That's because the default dynamic share memory size is only 46KB, which is not enough when the total length is longer than about 6k in a sampling kernel. You can try fixing this issue by adding

        if (smem_size >= 46 * 1024)
        { 
            cudaError_t res = cudaFuncSetAttribute(batchApplyRepetitionPenalty<T, RepetitionPenaltyType::Additive>,
                cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
        }

before this function call https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/cpp/tensorrt_llm/kernels/samplingPenaltyKernels.cu#L323. Remember reinstall TRT-LLM after changing.

byshiue · 2023-10-27T08:10:31Z

This issue is fixed by this MR #148, you can try on latest main branch. Close this bug. Feel free to reopen if needed.

byshiue self-assigned this Oct 24, 2023

byshiue added the triaged Issue has been triaged by maintainers label Oct 24, 2023

byshiue closed this as completed Oct 27, 2023

byshiue added the bug Something isn't working label Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set max_input_len, max_output_len=4096, But The actual input cannot reach this level #88

set max_input_len, max_output_len=4096, But The actual input cannot reach this level #88

callmezhangchenchenokay commented Oct 24, 2023

byshiue commented Oct 24, 2023

byshiue commented Oct 27, 2023

set max_input_len, max_output_len=4096, But The actual input cannot reach this level #88

set max_input_len, max_output_len=4096, But The actual input cannot reach this level #88

Comments

callmezhangchenchenokay commented Oct 24, 2023

byshiue commented Oct 24, 2023

byshiue commented Oct 27, 2023