Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion lib/bindings/python/rust/lib.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already support for tensor based models via register_llm, but maybe it would take some tweaking to skip tokenizer bits when given a tensor based model. Not sure if a whole new register_model function is needed, or if register_llm should just be renamed to register_model with some kind of LLM/tokenizer/HF related flag as an argument?

In general the python bindings are like our public facing APIs and will be more sticky once released.

CC @GuanLuo @grahamking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, skipping the all of the HuggingFace and config file download stuff is the main motivation here. Right now to deploy a tensor based model with register_llm you still have to pass a dummy hugging face model that dynamo will download and then do nothing with.

No strong opinion on supporting this via a new function vs. renaming the old one and adding an argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified some of the test files to illustrate the change here

Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,8 @@ fn register_llm<'p>(
ModelInput::Tensor => llm_rs::model_type::ModelInput::Tensor,
};

let is_tensor_based = model_type.inner.supports_tensor();

let model_type_obj = model_type.inner;

let inner_path = model_path.to_string();
Expand Down Expand Up @@ -323,7 +325,33 @@ fn register_llm<'p>(
.or_else(|| Some(source_path.clone()));

pyo3_async_runtimes::tokio::future_into_py(py, async move {
// Resolve the model path (local or fetch from HuggingFace)
// For TensorBased models, skip HuggingFace downloads and register directly
if is_tensor_based {
let model_name = model_name.unwrap_or_else(|| source_path.clone());
let mut card = llm_rs::model_card::ModelDeploymentCard::with_name_only(&model_name);
card.model_type = model_type_obj;
card.model_input = model_input;
card.user_data = user_data_json;

if let Some(cfg) = runtime_config {
card.runtime_config = cfg.inner;
}

// Register the Model Deployment Card via discovery interface
let discovery = endpoint.inner.drt().discovery();
let spec = rs::discovery::DiscoverySpec::from_model(
endpoint.inner.component().namespace().name().to_string(),
endpoint.inner.component().name().to_string(),
endpoint.inner.name().to_string(),
&card,
)
.map_err(to_pyerr)?;
discovery.register(spec).await.map_err(to_pyerr)?;

return Ok(());
}

// For non-TensorBased models, resolve the model path (local or fetch from HuggingFace)
let model_path = if fs::exists(&source_path)? {
PathBuf::from(&source_path)
} else {
Expand Down
4 changes: 4 additions & 0 deletions lib/bindings/python/src/dynamo/_core.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -1077,6 +1077,10 @@ async def register_llm(
Providing only one of these parameters will raise a ValueError.
- `lora_name`: The served model name for the LoRA model
- `base_model_path`: Path to the base model that the LoRA extends

For TensorBased models (using ModelInput.Tensor), HuggingFace downloads are skipped
and a minimal model card is registered directly. Use model_path as the display name
for these models.
"""
...

Expand Down
7 changes: 2 additions & 5 deletions lib/bindings/python/tests/test_tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,12 @@ async def test_register(runtime: DistributedRuntime):

assert model_config == runtime_config.get_tensor_model_config()

# [gluo FIXME] register_llm will attempt to load a LLM model,
# which is not well-defined for Tensor yet. Currently provide
# a valid model name to pass the registration.
# Use register_llm for tensor-based backends (skips HuggingFace downloads)
await register_llm(
ModelInput.Tensor,
ModelType.TensorBased,
endpoint,
"Qwen/Qwen3-0.6B",
"tensor",
"tensor", # model_path (used as display name for tensor-based models)
runtime_config=runtime_config,
)

Expand Down
9 changes: 9 additions & 0 deletions lib/llm/src/model_card.rs
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,15 @@ impl ModelDeploymentCard {
return Ok(());
}

// For TensorBased models, config files are not used - they handle everything in the backend
if self.model_type.supports_tensor() {
tracing::debug!(
display_name = %self.display_name,
"Skipping config download for TensorBased model"
);
return Ok(());
}

let ignore_weights = true;
let local_path = crate::hub::from_hf(&self.display_name, ignore_weights).await?;

Expand Down
7 changes: 2 additions & 5 deletions tests/frontend/grpc/echo_tensor_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,12 @@ async def echo_tensor_worker(runtime: DistributedRuntime):
)
assert model_config == retrieved_model_config

# [gluo FIXME] register_llm will attempt to load a LLM model,
# which is not well-defined for Tensor yet. Currently provide
# a valid model name to pass the registration.
# Use register_llm for tensor-based backends (skips HuggingFace downloads)
await register_llm(
ModelInput.Tensor,
ModelType.TensorBased,
endpoint,
"Qwen/Qwen3-0.6B",
"echo",
"echo", # model_path (used as display name for tensor-based models)
runtime_config=runtime_config,
)

Expand Down
Loading