You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I launch a TGI server on a A100 GPU machine and served Mistral-Nemo-Instruct-2407-GPTQ model.
As shown in config, I set max_input_tokens to 8192 and max_total_tokens to 10240. But when I sent a message contains more tokens than 8192, it seems not to be truncated. The error imf is shown below:
2024-10-11T11:27:58.527278Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/[mod.rs:105](http://mod.rs:105/): `inputs` tokens + `max_new_tokens` must be <= 10240. Given: 9266 `inputs` tokens and 1000 `max_new_tokens`
My question:
Will TGI automatically do truncation for user_input according to max_input_tokens?
Could I use some parameters to truncate input length to less than max_input_tokens?
Thanks a lot for help.
Expected behavior
Input tokens should be truncated.
The text was updated successfully, but these errors were encountered:
System Info
Docker
Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.0
Commit sha: 169178b
Docker label: sha-169178b
nvidia-smi
Args {
model_id: "/share/base_model/Mistral-Nemo-Instruct-2407-GPTQ",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: Some(
Gptq,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
8192,
),
max_input_length: None,
max_total_tokens: Some(
10240,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "545eaf4c39af",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
}
Information
Tasks
Reproduction
I launch a TGI server on a A100 GPU machine and served Mistral-Nemo-Instruct-2407-GPTQ model.
As shown in config, I set max_input_tokens to 8192 and max_total_tokens to 10240. But when I sent a message contains more tokens than 8192, it seems not to be truncated. The error imf is shown below:
My question:
Thanks a lot for help.
Expected behavior
Input tokens should be truncated.
The text was updated successfully, but these errors were encountered: