Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set max_input_len, max_output_len=4096, But The actual input cannot reach this level #88

Closed
callmezhangchenchenokay opened this issue Oct 24, 2023 · 2 comments
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@callmezhangchenchenokay

Baichuan2-13B-Chat


python examples/baichuan/build.py --model_version v2_13b --max_input_len=4096 --max_output_len=4096 --model_dir ./models/Baichuan2-13B-Chat/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --use_weight_only --output_dir ./models/tmp/baichuan_v2_13b/trt_engines/fp16+flashattention+int8+4096/1-gpu/

Model conversion successful


python examples/baichuan/run.py --model_version v2_13b --max_output_len=4096 --tokenizer_dir=./models/Baichuan2-13B-Chat/ --engine_dir=./models/tmp/baichuan_v2_13b/trt_engines/fp16+flashattention+int8+4096/1-gpu/

image

run.py Modified some content
''' python
input = """***""" # *** = print('input_tokens:',len(input_tokens[0])) The length of the input needs to be greater than around 2000 to make an error, But the max_output_len is 4096
parser.add_argument('--input_text', type=str, default=input)
'''

@byshiue
Copy link
Collaborator

byshiue commented Oct 24, 2023

Baichuan2-13B-Chat

‘ python examples/baichuan/build.py --model_version v2_13b --max_input_len=4096 --max_output_len=4096 --model_dir ./models/Baichuan2-13B-Chat/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --use_weight_only --output_dir ./models/tmp/baichuan_v2_13b/trt_engines/fp16+flashattention+int8+4096/1-gpu/ ’ Model conversion successful

‘ python examples/baichuan/run.py --model_version v2_13b --max_output_len=4096 --tokenizer_dir=./models/Baichuan2-13B-Chat/ --engine_dir=./models/tmp/baichuan_v2_13b/trt_engines/fp16+flashattention+int8+4096/1-gpu/ ’ image

run.py Modified some content ''' python input = """***""" # *** = print('input_tokens:',len(input_tokens[0])) The length of the input needs to be greater than around 2000 to make an error, But the max_output_len is 4096 parser.add_argument('--input_text', type=str, default=input) '''

That's because the default dynamic share memory size is only 46KB, which is not enough when the total length is longer than about 6k in a sampling kernel. You can try fixing this issue by adding

        if (smem_size >= 46 * 1024)
        { 
            cudaError_t res = cudaFuncSetAttribute(batchApplyRepetitionPenalty<T, RepetitionPenaltyType::Additive>,
                cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
        }

before this function call https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/cpp/tensorrt_llm/kernels/samplingPenaltyKernels.cu#L323. Remember reinstall TRT-LLM after changing.

@byshiue byshiue self-assigned this Oct 24, 2023
@byshiue byshiue added the triaged Issue has been triaged by maintainers label Oct 24, 2023
@byshiue
Copy link
Collaborator

byshiue commented Oct 27, 2023

This issue is fixed by this MR #148, you can try on latest main branch. Close this bug. Feel free to reopen if needed.

@byshiue byshiue closed this as completed Oct 27, 2023
@byshiue byshiue added the bug Something isn't working label Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants
@byshiue @callmezhangchenchenokay and others