[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() #2708

akhoroshev · 2025-01-21T14:13:58Z

Errors happens under load after some time.

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:256)
1             0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2       0x7f07f479dd5e /home/askhoroshev/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x827d5e) [0x7f07f479dd5e]
3       0x7f07f6a8346d tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(std::shared_ptr<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock>, std::optional<int>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 141
4       0x7f07f6a835cf tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 223
5       0x7f07f6a84af7 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::vector<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey, std::allocator<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey> > const&, int, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::vector<tensorrt_llm::executor::RetentionPriorityAndDuration, std::allocator<tensorrt_llm::executor::RetentionPriorityAndDuration> > const&) + 951
6       0x7f07f6a86558 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int, tensorrt_llm::batch_manager::LlmRequest&) + 712
7       0x7f07f6a881bd tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(unsigned long, int, int, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::LlmRequest>) + 2445
8       0x7f07f6a3f34c tensorrt_llm::batch_manager::AllocateKvCache::operator()(tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager>) const + 300
9       0x7f07f6adaf07 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1479
10      0x7f07f6b704a1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 433
11      0x7f07f6b775bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 956
12      0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
13      0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
14      0x7f0792e0b8d3 clone + 67
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:256)
1             0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2       0x7f07f479dd5e /home/askhoroshev/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x827d5e) [0x7f07f479dd5e]
3       0x7f07f6a8346d tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(std::shared_ptr<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock>, std::optional<int>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 141
4       0x7f07f6a835cf tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 223
5       0x7f07f6a84af7 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::vector<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey, std::allocator<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey> > const&, int, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::vector<tensorrt_llm::executor::RetentionPriorityAndDuration, std::allocator<tensorrt_llm::executor::RetentionPriorityAndDuration> > const&) + 951
6       0x7f07f6a86558 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int, tensorrt_llm::batch_manager::LlmRequest&) + 712
7       0x7f07f6a881bd tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(unsigned long, int, int, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::LlmRequest>) + 2445
8       0x7f07f6a3f34c tensorrt_llm::batch_manager::AllocateKvCache::operator()(tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager>) const + 300
9       0x7f07f6adaf07 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1479
10      0x7f07f6b704a1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 433
11      0x7f07f6b775bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 956
12      0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
13      0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
14      0x7f0792e0b8d3 clone + 67 (/home/askhoroshev/TensorRT-LLM/modules/executor_server/src/serverImpl.cpp:412)
1             0x415048 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2             0x41f1fa /home/askhoroshev/TensorRT-LLM/cpp/build/modules/executor_server/executor_server() [0x41f1fa]
3             0x4acb90 /home/askhoroshev/TensorRT-LLM/cpp/build/modules/executor_server/executor_server() [0x4acb90]
4       0x7f07e0448930 /home/askhoroshev/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7930) [0x7f07e0448930]
5       0x7f0793adf1ca /lib64/libpthread.so.0(+0x81ca) [0x7f0793adf1ca]
6       0x7f0792e0b8d3 clone + 67

config.json

{
    "version": "0.16.0.dev2024120300",
    "pretrained_config": {
        "architecture": "DeepseekForCausalLM",
        "dtype": "bfloat16",
        "vocab_size": 42064,
        "hidden_size": 2048,
        "num_hidden_layers": 28,
        "num_attention_heads": 16,
        "hidden_act": "swiglu",
        "logits_dtype": "float32",
        "norm_epsilon": 1e-05,
        "runtime_defaults": null,
        "position_embedding_type": "rope_gpt_neox",
        "num_key_value_heads": 8,
        "intermediate_size": 14336,
        "max_position_embeddings": 131072,
        "mapping": {
            "world_size": 1,
            "gpus_per_node": 8,
            "cp_size": 1,
            "tp_size": 1,
            "pp_size": 1,
            "moe_tp_size": 1,
            "moe_ep_size": 1
        },
        "quantization": {
            "quant_algo": null,
            "kv_cache_quant_algo": null,
            "group_size": 128,
            "smoothquant_val": 0.5,
            "clamp_val": null,
            "use_meta_recipe": false,
            "has_zero_point": false,
            "pre_quant_scale": false,
            "exclude_modules": null
        },
        "use_parallel_embedding": false,
        "embedding_sharding_dim": 0,
        "share_embedding_table": false,
        "head_size": 128,
        "qk_layernorm": false,
        "rotary_embedding_dim": 128,
        "return_context_hidden": false,
        "logits_type": "float32",
        "moe_intermediate_size": 1792,
        "rotary_base": 300000,
        "rotary_scaling": null,
        "moe": {
            "num_experts": 64,
            "shared_expert_intermediate_size": 3584,
            "top_k": 6,
            "normalization_mode": 0
        }
    },
    "build_config": {
        "max_input_len": 130048,
        "max_seq_len": 131072,
        "opt_batch_size": 8,
        "max_batch_size": 256,
        "max_beam_width": 1,
        "max_num_tokens": 4096,
        "opt_num_tokens": 256,
        "max_prompt_embedding_table_size": 0,
        "kv_cache_type": "PAGED",
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": true,
        "force_num_profiles": null,
        "profiling_verbosity": "layer_names_only",
        "enable_debug_output": false,
        "max_draft_len": 0,
        "speculative_decoding_mode": 1,
        "use_refit": false,
        "input_timing_cache": null,
        "output_timing_cache": "model.cache",
        "lora_config": {
            "lora_dir": [],
            "lora_ckpt_source": "hf",
            "max_lora_rank": 64,
            "lora_target_modules": [],
            "trtllm_modules_to_hf_modules": {}
        },
        "auto_parallel_config": {
            "world_size": 1,
            "gpus_per_node": 8,
            "cluster_key": "H100-PCIe",
            "cluster_info": null,
            "sharding_cost_model": "alpha_beta",
            "comm_cost_model": "alpha_beta",
            "enable_pipeline_parallelism": false,
            "enable_shard_unbalanced_shape": false,
            "enable_shard_dynamic_shape": false,
            "enable_reduce_scatter": true,
            "builder_flags": null,
            "debug_mode": false,
            "infer_shape": true,
            "validation_mode": false,
            "same_buffer_io": {
                "past_key_value_(\\d+)": "present_key_value_\\1"
            },
            "same_spec_io": {},
            "sharded_io_allowlist": [
                "past_key_value_\\d+",
                "present_key_value_\\d*"
            ],
            "fill_weights": false,
            "parallel_config_cache": null,
            "profile_cache": null,
            "dump_path": null,
            "debug_outputs": []
        },
        "weight_sparsity": false,
        "weight_streaming": false,
        "plugin_config": {
            "dtype": "bfloat16",
            "bert_attention_plugin": "auto",
            "gpt_attention_plugin": "auto",
            "gemm_plugin": "bfloat16",
            "gemm_swiglu_plugin": null,
            "fp8_rowwise_gemm_plugin": null,
            "smooth_quant_gemm_plugin": null,
            "qserve_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": null,
            "smooth_quant_plugins": true,
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "moe_plugin": "auto",
            "mamba_conv1d_plugin": "auto",
            "low_latency_gemm_plugin": null,
            "low_latency_gemm_swiglu_plugin": null,
            "context_fmha": true,
            "bert_context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "reduce_fusion": false,
            "user_buffer": false,
            "tokens_per_block": 64,
            "use_paged_context_fmha": true,
            "use_fp8_context_fmha": false,
            "multiple_profiles": false,
            "paged_state": false,
            "streamingllm": false,
            "manage_weights": false,
            "use_fused_mlp": true,
            "pp_reduce_scatter": false
        },
        "use_strip_plan": false,
        "max_encoder_input_len": 1024,
        "use_fused_mlp": true,
        "monitor_memory": false,
        "use_mrope": false
    }
}

Chunked context and context reuse are enabled.

Starting logs

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Set logger level to INFO
[TensorRT-LLM][INFO] ExecutorServer on rank 0 starting...
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 131071  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 131072 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 38675 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 453.61 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 38668 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 809.11 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 692.36 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.21 GiB, available: 38.82 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 5112
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2048
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 34.95 GiB for max tokens in paged KV cache (327168).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2025-01-22T14:24:50Z

Hi @Funatiq could u plz take a look this issue?

akhoroshev · 2025-01-22T20:49:27Z

@nv-guomingz @Funatiq Could you also look at these issues?

#2626
#2494

akhoroshev · 2025-01-23T15:21:31Z

this is also reproduced in the version d93a2dde84ead

[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:257)
1             0x413dcc tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2       0x7f492ab87384 /home/askhoroshev/test_trtllm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x81e384) [0x7f492ab87384]
3       0x7f492ce5b765 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(std::shared_ptr<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock>, std::optional<int>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 117
4       0x7f492ce5b85c tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 220
5       0x7f492ce5cc67 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::vector<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey, std::allocator<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey> > const&, int, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::vector<tensorrt_llm::executor::RetentionPriorityAndDuration, std::allocator<tensorrt_llm::executor::RetentionPriorityAndDuration> > const&) + 919
6       0x7f492ce5e4f8 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int, tensorrt_llm::batch_manager::LlmRequest&) + 712
7       0x7f492ce6016d tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(unsigned long, int, int, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::LlmRequest>) + 2445
8       0x7f492ce0befb tensorrt_llm::batch_manager::AllocateKvCache::operator()(tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager>) const + 299
9       0x7f492ceb914f tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1471
10      0x7f492cf50431 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 433
11      0x7f492cf5840c tensorrt_llm::executor::Executor::Impl::executionLoop() + 972
12      0x7f49166db970 /home/askhoroshev/test_trtllm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7970) [0x7f49166db970]
13      0x7f48c9d721ca /lib64/libpthread.so.0(+0x81ca) [0x7f48c9d721ca]
14      0x7f48c909e8d3 clone + 67
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty() (/home/jenkins/agent/workspace/LLM/helpers/Build-x86_64/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:257)
1             0x413dcc tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2       0x7f492ab87384 /home/askhoroshev/test_trtllm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x81e384) [0x7f492ab87384]
3       0x7f492ce5b765 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::claimLeafBlock(std::shared_ptr<tensorrt_llm::batch_manager::kv_cache_manager::KVCacheBlock>, std::optional<int>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 117
4       0x7f492ce5b85c tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::getFreeBlock(int, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 220
5       0x7f492ce5cc67 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::loadOrAllocateBlocks(std::vector<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey, std::allocator<tensorrt_llm::batch_manager::kv_cache_manager::BlockKey> > const&, int, tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::vector<tensorrt_llm::executor::RetentionPriorityAndDuration, std::allocator<tensorrt_llm::executor::RetentionPriorityAndDuration> > const&) + 919
6       0x7f492ce5e4f8 tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::addSequence(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, int, int, tensorrt_llm::batch_manager::LlmRequest&) + 712
7       0x7f492ce6016d tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::addSequence(unsigned long, int, int, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::LlmRequest>) + 2445
8       0x7f492ce0befb tensorrt_llm::batch_manager::AllocateKvCache::operator()(tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::common::OptionalRef<tensorrt_llm::batch_manager::kv_cache_manager::BaseKVCacheManager>) const + 299
9       0x7f492ceb914f tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1471
10      0x7f492cf50431 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 433
11      0x7f492cf5840c tensorrt_llm::executor::Executor::Impl::executionLoop() + 972
12      0x7f49166db970 /home/askhoroshev/test_trtllm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7970) [0x7f49166db970]
13      0x7f48c9d721ca /lib64/libpthread.so.0(+0x81ca) [0x7f48c9d721ca]
14      0x7f48c909e8d3 clone + 67 (/home/askhoroshev/test_trtllm/modules/executor_server/src/serverImpl.cpp:411)
1             0x413dcc tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 71
2             0x415643 /home/askhoroshev/test_trtllm/cpp/build/modules/executor_server/executor_server() [0x415643]
3             0x46a050 /home/askhoroshev/test_trtllm/cpp/build/modules/executor_server/executor_server() [0x46a050]
4       0x7f49166db970 /home/askhoroshev/test_trtllm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x32e7970) [0x7f49166db970]
5       0x7f48c9d721ca /lib64/libpthread.so.0(+0x81ca) [0x7f48c9d721ca]
6       0x7f48c909e8d3 clone + 67

akhoroshev changed the title ~~[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: mNextBlocks.empty()~~ [TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() Jan 21, 2025

akhoroshev changed the title ~~[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty()~~ [bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() Jan 21, 2025

nv-guomingz added the Generic Runtime label Jan 22, 2025

nv-guomingz assigned Funatiq Jan 22, 2025

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 22, 2025

MartinMarciniszyn assigned dcampora and unassigned Funatiq Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() #2708

[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() #2708

akhoroshev commented Jan 21, 2025 •

edited

Loading

nv-guomingz commented Jan 22, 2025

akhoroshev commented Jan 22, 2025

akhoroshev commented Jan 23, 2025

[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() #2708

[bug] Encountered an error in forwardAsync function: Assertion failed: mNextBlocks.empty() #2708

Comments

akhoroshev commented Jan 21, 2025 • edited Loading

nv-guomingz commented Jan 22, 2025

akhoroshev commented Jan 22, 2025

akhoroshev commented Jan 23, 2025

akhoroshev commented Jan 21, 2025 •

edited

Loading