feat: DIS-373 dynamo KVBM connector API integration with TRTLLM #2440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

richardhuo-nv wants to merge 19 commits into main from rihuo/trtllm-connector

Contributor

richardhuo-nv commented Aug 14, 2025

Overview:

dynamo KVBM connector API integration with TRTLLM

Details:

The PR is based on the ongoing changes from NVIDIA/TensorRT-LLM#6488, which added TRT-LLM connector API compatibility.

Changes in this PR:

Added support for an external KV cache layout as fully contiguous, since TRT-LLM’s cache layout is fully contiguous.
Instead of providing bytes_per_block from the KVBM leader bindings, the KVBM leader now reads bytes_per_block from the worker to obtain a more precise estimation. This is because there is no easy or reliable way to extract bytes_per_block directly from the TRT-LLM KV cache.
This introduces additional complexity to the leader–worker barrier: two barrier synchronizations are now required.
Worker → Leader: send bytes_per_block.
Leader → Worker: send num_host_blocks and num_disk_blocks.
Added the basic Rust-based leader–worker integration for TRT-LLM, with some minor compatibility changes.
The only big change is that when the leader calls update_state_after_alloc, no num_external_tokens is passed to the function. For now, a HashMap is used to track each request’s num_external_tokens when calling get_num_new_matched_tokens.

Issues:

The TRT-LLM scheduler_output does not contain num_scheduled_tokens information for each request, preventing the leader from properly managing slot states.
A separate Python module and Rust crate will need to be built specifically for TRT-LLM. For now, TRT-LLM is included in vllm_integration to reuse some core KVBM integration code.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

ryanolson and others added 17 commits

August 2, 2025 20:05


          squshing kvbm to a dev branch pre main merge

fd8e4cf


          removing unused method

5ce1937


          update envs to DYN_ to align with project

105e68d


          Merge branch 'main' into ryan/connector-dev

be2b3b4


          adding more details to advance_computed_position error

0631a23


          Merge branch 'ryan/connector-dev' of github.com:ai-dynamo/dynamo into…

d5c452f

… ryan/connector-dev


          Enforced kv layer ordering (#2312)

d52b42e


          feat: add recorder to KVBM Connector (#2289)

8c4beb7


          fix: re-enable deduplication (#2343)

89c87ef


          fix: chunked prefill update (#2307)

07798c8

Signed-off-by: Ryan Olson <ryanolson@users.noreply.github.com>
Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>


          refactor: separate pyo3 from rust in vllm connector worker leader (#2344

714bf21

)

Signed-off-by: Ryan Olson <ryanolson@users.noreply.github.com>
Co-authored-by: Ryan Olson <ryanolson@users.noreply.github.com>


          clean up

3c1360d


          adding tracing at info level to try to capture recovery from preempti…

e823599

…on/eviction


          adding integration stubs

142e626


          fully resetting the slot state if we are in a prefilling state on an …

002da46

…unexpected call to get_num_new_matched_tokens


          handling skipping and request abortion logic (#2407)

fc4787d


          cleanup (#2425)

73d2edf

copy-pr-bot bot commented Aug 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pull-request-size bot added the size/XXL label

github-actions bot added the feat label

richardhuo-nv requested review from oandreeva-nv and ziqifan617

August 14, 2025 06:54


          trtllm integration connector api

d192dfc

fix

fix

fix

richardhuo-nv force-pushed the rihuo/trtllm-connector branch from 721abbe to d192dfc Compare

August 14, 2025 16:52

jthomson04 reviewed

View reviewed changes

lib/bindings/python/rust/llm/block_manager/distributed/leader.rs Outdated

    
                          .drt(drt.inner().clone())

                          .host_blocks_config(get_blocks_config(CPU_CACHE, CPU_CACHE_OVERRIDE))

                          .disk_blocks_config(get_blocks_config(DISK_CACHE, DISK_CACHE_OVERRIDE))

                          .bytes_per_block_overriden(bytes_per_block)

Contributor

jthomson04 Aug 14, 2025

Supporting two separate code paths here seems a bit ugly. Might be simpler to remove the bytes_per_block arg in the bindings and do the double-barrier thing in both vLLM and TRTLLM

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs

    
              use std::collections::HashSet;

              use anyhow;

              pub trait Leader: Send + Sync + std::fmt::Debug {

Contributor

jthomson04 Aug 14, 2025

Why have a trait here? AFAIK, we'd only have 1 implementation of this

Contributor Author

richardhuo-nv Aug 14, 2025

I will import from leader.rs

Contributor Author

richardhuo-nv Aug 15, 2025

Oh, actually I find I cannot really import yet, since there are actually some difference between the two traits, some args and arg types are slightly different.

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs

    
                      );

                      // the number of device matched tokens should be less than or equal to the number of tokens in the request

                      debug_assert!(num_computed_tokens % self.block_size == 0);

Contributor

jthomson04 Aug 14, 2025

The TRTLLM KV cache connector can match partial blocks, so this assertion won't always work. See https://github.com/NVIDIA/TensorRT-LLM/blob/69574ad73078656ad0530559888552e3a0cd51e2/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp#L395

Contributor

jthomson04 Aug 14, 2025

To simplify things, we could just immediately return 0, false if it's not divisible

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs

    
                      // return the number of external tokens that are ready for onboarding

                      // we always return true here as we always asynchronously onboard matched blocks

                      if let SlotState::OnboardStaged(num_external_tokens) = slot.state() {

Contributor

jthomson04 Aug 14, 2025

Not specific to this implementation, but always onboarding asynchronously may not be worthwhile. Not something we need to worry about now, but certainly need to keep that in mind longer term

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs

    
                      }

                  }

                  /// Note: TRTLLM will not provide any scheduler output data for requests that are onboarding. it is entirely

Contributor

jthomson04 Aug 14, 2025

Yes. With my current trtllm implementation, we only include a request in the scheduler output if at least 1 token is scheduled. So if we onboard with async=False, it will be included, but with async=True, it won't be included.

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs

    
                      if let Some(&num_external_tokens) = self.inflight_request_to_num_external_tokens.get(&request_id) {

                          if num_external_tokens > 0 {

                              let num_computed_tokens = block_ids.len() * self.block_size - num_external_tokens;

Contributor

jthomson04 Aug 14, 2025

This doesn't seem right. Instead of block_ids.len() * block_size we should be accessing the context_current_position field on the LlmRequest object. block_ids is a list of all blocks allocated for the entire prefill, and is independent of the amount of device or connector cache hits.

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs

    
                              .get(request_id)

                              .unwrap_or(&0);

                          slot.apply_scheduler_output(&[], &[], new_req.num_computed_tokens, scheduled_tokens)?;

Contributor

jthomson04 Aug 14, 2025

Why are we ignoring the new token and block ids here?

lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_worker.py

    
                          raw_event_handles,

                      )

                  def bind_connector_meta(self, metadata: object):

Contributor

jthomson04 Aug 14, 2025

Probably not great to be overriding bind_connector_meta, given that it isn't an abstract method to be overridden.

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs Outdated

    
                          .map_err(to_pyerr)

                  }

                  pub fn build_connector_meta(&mut self, metadata: Vec<u8>) -> PyResult<()> {

Contributor

jthomson04 Aug 14, 2025

Is this a typo? Should this be bind_connector_meta?

lib/bindings/python/src/dynamo/llm/trtllm_integration/connector/kvbm_connector_worker.py Outdated

    
              class DynamoKVBMConnectorWorker(KvCacheConnectorWorker):

                  def __init__(self, executor_config: ExecutorConfig, **kwargs):

Contributor

jthomson04 Aug 14, 2025

TRTLLM will never instantiate this with kwargs.

jthomson04 reviewed

View reviewed changes

lib/llm/src/block_manager/distributed/leader.rs

    
              }

              fn compute_num_blocks(num_blocks_config: &KvbmLeaderNumBlocksConfig, bytes_per_block: usize) -> usize {

                  if num_blocks_config.is_overriden {

Contributor

jthomson04 Aug 14, 2025 •

edited

Loading

is_overriden should be a computed property that checks if num_blocks_overriden is 0. Alternatively, you could remove is_overriden entirely and just have a get_num_blocks method.

jthomson04 reviewed

View reviewed changes

lib/llm/src/block_manager/distributed/leader.rs Outdated

    
                          .unwrap();

                      tracing::info!("Leader barrier synced with {} workers", config.world_size);

                      let mut bytes_per_block = worker_data

Contributor

jthomson04 Aug 14, 2025

We should be summing this across all workers, and directly use that as our bytes per block value. With tp, this approach would work, but we shouldn't assume that all are the same.


          some small fixes

0aad185

richardhuo-nv force-pushed the rihuo/trtllm-connector branch 2 times, most recently from d5bc4e5 to 0aad185 Compare

August 16, 2025 01:32

Base automatically changed from ryan/connector-dev to main

August 19, 2025 19:36

richardhuo-nv closed this

richardhuo-nv mentioned this pull request

feat: DIS-373 dynamo KVBM connector API integration with TRTLLM #2544

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels