Skip to content

Comments

feat: add local inference provider with llama.cpp backend and HuggingFace model management#6933

Merged
jh-block merged 66 commits intomainfrom
local-models-candle
Feb 19, 2026
Merged

feat: add local inference provider with llama.cpp backend and HuggingFace model management#6933
jh-block merged 66 commits intomainfrom
local-models-candle

Conversation

@DOsinga
Copy link
Collaborator

@DOsinga DOsinga commented Feb 3, 2026

Summary

Adds a local inference provider that enables running language models directly on-device using llama.cpp, with full integration across CLI, server, and desktop UI.

Core Changes

  • Local inference provider (crates/goose/src/providers/local_inference/): New provider with modular architecture split into submodules:

    • inference_native_tools: Inference path for models with native tool-calling support
    • inference_emulated_tools: Inference path for models without native tool support, using text-based $ command and ````execute` block detection
    • inference_engine: Shared inference primitives — context management, sampling, and token generation
    • tool_parsing: Parsing tool calls from model output (JSON and XML formats)
    • hf_models: HuggingFace Hub model search and GGUF file discovery
    • local_model_registry: Local model download and lifecycle management
  • Server routes (crates/goose-server/src/routes/local_inference.rs): REST endpoints for model download, listing, deletion, and status

  • CLI integration (crates/goose-cli/): New local-models subcommand for managing downloaded models, plus provider selection support

  • Desktop UI (ui/desktop/src/components/):

    • LocalModelSetup.tsx: First-run setup flow for downloading models
    • LocalInferenceSettings.tsx: Settings panel for model management
    • HuggingFaceModelSearch.tsx: Search and download models from HuggingFace
    • ModelSettingsPanel.tsx: Per-model configuration

Additional Changes

  • Tiny model system prompt for resource-constrained local models
  • Session name generation support for local inference conversations
  • Download progress display improvements

Known Limitations (deferred to future iterations)

  • Memory management is best-effort: estimate_max_context_for_memory uses 50% of available memory for KV cache, which is a rough heuristic. available_inference_memory_bytes picks the max of any single accelerator device, which may not be correct for multi-GPU setups or unified memory architectures. On unified-memory Macs, free memory fluctuates with system load, so a model that fits during estimation may OOM during inference. There is currently no fallback or recovery for OOM — it will crash the process.

  • Model unloading strategy is naive: When loading a new model, all other models are unloaded. This is fine for single-model usage but the code structure (HashMap of model slots) suggests multi-model was considered. A smarter eviction strategy (e.g. LRU) could be added if multi-model support is needed.

  • Featured models list is hardcoded: The FEATURED_MODELS constant in local_model_registry.rs is a static list of HuggingFace model specs. There is no mechanism to update it without a code change — it will go stale as newer/better models are released. A future iteration could fetch this list from a remote config or allow user customization.

Copilot AI review requested due to automatic review settings February 3, 2026 23:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces support for running local LLMs via Candle-backed inference, including backend provider integration, REST endpoints for managing model downloads, and desktop UI to configure and select local models.

Changes:

  • Add a new local provider in the core goose crate that loads quantized GGUF models with architecture-aware handling (Llama, Phi, Phi-3) and custom chat templates, plus support for streaming responses.
  • Expose new server routes and OpenAPI definitions for listing local models, starting/canceling downloads, querying download progress, and deleting downloaded models, reusing the shared download manager.
  • Extend the desktop UI and API client to manage local LLM downloads, display progress, select an active local model, and surface dedicated guidance when the local provider is selected.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ui/desktop/src/components/settings/models/subcomponents/SwitchModelModal.tsx Adds a special informational panel when the local provider is selected, guiding users to the Models settings to download local models instead of showing generic provider error UI.
ui/desktop/src/components/settings/models/ModelsSection.tsx Embeds the new LocalInferenceSettings card into the Models settings page so users can manage local LLMs alongside cloud providers.
ui/desktop/src/components/settings/localInference/LocalInferenceSettings.tsx New React component for listing available local LLMs, starting/canceling downloads with progress visualization, and persisting the selected local model and provider in config.
ui/desktop/src/components/settings/dictation/LocalModelManager.tsx Adjusts the dictation local model manager’s “show all / recommended only” logic and toggle label to match the new local LLM UI behavior.
ui/desktop/src/api/types.gen.ts Extends generated types with LocalLlmModel, LocalModelResponse, ModelTier, and request/response types for the /local-inference/models endpoints.
ui/desktop/src/api/sdk.gen.ts Adds typed client functions (listLocalModels, downloadLocalModel, getLocalModelDownloadProgress, cancelLocalModelDownload, deleteLocalModel) for the new local inference endpoints.
ui/desktop/src/api/index.ts Re-exports the new local inference client functions and types so the UI can consume them through the existing API barrel.
ui/desktop/openapi.json Updates the desktop-copied OpenAPI spec with the new /local-inference/models paths, schemas for LocalLlmModel, LocalModelResponse, and ModelTier, and ties them into the existing components section.
crates/goose/src/providers/mod.rs Registers the new local_inference provider module in the providers namespace.
crates/goose/src/providers/local_inference.rs Implements the LocalInferenceProvider with model metadata, recommendation logic, GGUF loading for Llama/Phi/Phi-3, prompt templating, and both non-streaming and streaming completion APIs.
crates/goose/src/providers/init.rs Adds LocalInferenceProvider to the provider registry so it can be created and used like other built-in providers.
crates/goose/src/dictation/download_manager.rs Generalizes the download manager to optionally set an arbitrary config key/value on successful download completion, so it can be reused for both dictation and local LLM models.
crates/goose/Cargo.toml Registers new examples (candle_quantized, test_local_provider) to exercise the local quantized model and provider behavior.
crates/goose-server/src/routes/utils.rs Treats the local provider as always “configured” in provider status checks, bypassing API key requirements since local models are configured via file downloads.
crates/goose-server/src/routes/mod.rs Wires the new local_inference route module into the main router so its endpoints are served.
crates/goose-server/src/routes/local_inference.rs Defines REST handlers for listing local models, starting downloads of model and tokenizer files, querying combined download progress, canceling downloads, and deleting model/tokenizer files.
crates/goose-server/src/routes/dictation.rs Updates dictation model downloads to use the new generalized download manager signature and set the dictation config key on download completion.
crates/goose-server/src/openapi.rs Registers the new local inference routes and schemas (LocalModelResponse, LocalLlmModel, ModelTier) with utoipa so they appear in the server’s OpenAPI output.

Comment on lines 27 to 42
fn convert_error(e: anyhow::Error) -> ErrorResponse {
let error_msg = e.to_string();

if error_msg.contains("not configured") || error_msg.contains("not found") {
ErrorResponse {
message: error_msg,
status: StatusCode::PRECONDITION_FAILED,
}
} else if error_msg.contains("already in progress") {
ErrorResponse {
message: error_msg,
status: StatusCode::BAD_REQUEST,
}
} else {
ErrorResponse::internal(error_msg)
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_error maps errors containing "not found" to HTTP 412 PRECONDITION_FAILED, but the OpenAPI annotations and generated TypeScript types for the local inference download‑cancel route document a 404 "Download not found" error, so a missing download currently produces a status that does not match the declared API; updating this branch to return 404 for missing downloads will align server behavior with the documented contract and client expectations.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 4, 2026 11:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Comment on lines 95 to 97
// Download tokenizer file (set config and provider when this completes)
// We'll set GOOSE_PROVIDER to "local" after the tokenizer download completes
// This is handled in the download_manager callback
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here says that GOOSE_PROVIDER will be set to "local" when the tokenizer download completes, but the download_model call only passes LOCAL_LLM_MODEL_CONFIG_KEY as the config key, so GOOSE_PROVIDER/GOOSE_MODEL are never updated in this path; either the comment should be corrected or the code should also set those config values if that behavior is required.

Suggested change
// Download tokenizer file (set config and provider when this completes)
// We'll set GOOSE_PROVIDER to "local" after the tokenizer download completes
// This is handled in the download_manager callback
// Download tokenizer file and update the local LLM model config when this completes.
// This is handled in the download_manager callback using LOCAL_LLM_MODEL_CONFIG_KEY.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 6, 2026 12:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 38 changed files in this pull request and generated 7 comments.

Comment on lines +97 to +99
// Special override
if metadata.name == "local" {
return true;
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded check for provider name "local" bypasses the normal configuration check. While this may be intentional for local models that don't require auth, it should verify that at least one model is downloaded before returning true. Otherwise, the UI may show the provider as configured when no models are available for use.

Suggested change
// Special override
if metadata.name == "local" {
return true;
// Special override for local provider: only consider it configured if at least one model is available.
if metadata.name == "local" {
if let Ok(loaded_provider) = load_provider(metadata.name.as_str()) {
if !loaded_provider.models.is_empty() {
return true;
}
}

Copilot uses AI. Check for mistakes.
) -> Option<String> {
if let Ok(provider_guard) = self.provider.try_lock() {
if let Some(provider) = provider_guard.as_ref() {
if provider.get_model_config().context_limit() < 9 * 1024 * 1024 {
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context limit check has incorrect calculation. Line 1485 compares context_limit() against 9 * 1024 * 1024 (9 million tokens), but context_limit() returns token count, not bytes. This threshold is unrealistically high - even the largest models have ~200K token limits. This should be 9 * 1024 (9K tokens) instead.

Suggested change
if provider.get_model_config().context_limit() < 9 * 1024 * 1024 {
if provider.get_model_config().context_limit() < 9 * 1024 {

Copilot uses AI. Check for mistakes.
Comment on lines 647 to 649
eprintln!(
"DEBUG: Loading platform extension '{}': {}",
name, provider_state
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug logging using eprintln! should be removed or converted to tracing::debug!. These debug statements were likely added during development and should not be left in production code.

Suggested change
eprintln!(
"DEBUG: Loading platform extension '{}': {}",
name, provider_state
tracing::debug!(
"Loading platform extension '{}': {}",
name,
provider_state

Copilot uses AI. Check for mistakes.
Comment on lines 30 to 34
let context_limit = context.get_context_limit();
eprintln!(
"DEBUG: TodoClient::new - context_limit from provider: {:?}",
context_limit
);
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug logging using eprintln! should be removed or converted to tracing::debug!. These debug statements were likely added during development and should not be left in production code.

Copilot uses AI. Check for mistakes.
Comment on lines 112 to 123
eprintln!(
"DEBUG: AppsManagerClient::new - model: {:?}, context_limit: {:?}",
model_name, context_limit
);

match context.require_min_context(10_000, EXTENSION_NAME) {
Ok(_) => eprintln!("DEBUG: AppsManagerClient context check PASSED"),
Err(e) => {
eprintln!("DEBUG: AppsManagerClient context check FAILED: {}", e);
return Err(e.to_string());
}
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug logging using eprintln! should be removed or converted to tracing::debug!. These debug statements were likely added during development and should not be left in production code.

Copilot uses AI. Check for mistakes.
crate::prompt_template::render_template("tiny_model_system.md", &context)
.unwrap_or_else(|e| {
// Fallback if template fails to load
eprintln!("WARNING: Failed to load tiny_model_system.md: {:?}", e);
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error logging using eprintln! should use tracing::error! or tracing::warn! instead for consistency with the rest of the codebase.

Suggested change
eprintln!("WARNING: Failed to load tiny_model_system.md: {:?}", e);
tracing::warn!("Failed to load tiny_model_system.md: {:?}", e);

Copilot uses AI. Check for mistakes.
Comment on lines 78 to 102
const pollDownloadProgress = (modelId: string) => {
const interval = setInterval(async () => {
try {
const response = await getLocalModelDownloadProgress({ path: { model_id: modelId } });
if (response.data) {
const progress = response.data;
setDownloads((prev) => new Map(prev).set(modelId, progress));

if (progress.status === 'completed') {
clearInterval(interval);
await loadModels(); // Refresh model list
// Auto-select the model that was just downloaded
await selectModel(modelId);
} else if (progress.status === 'failed') {
clearInterval(interval);
await loadModels();
}
} else {
clearInterval(interval);
}
} catch {
clearInterval(interval);
}
}, 500);
};
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak: interval is not stored and cannot be cleared on component unmount. Store the interval ID in a ref or state and clear it in useEffect cleanup to prevent the polling from continuing after the component unmounts.

Copilot uses AI. Check for mistakes.
Douwe Osinga and others added 2 commits February 6, 2026 14:02
And only allow selection of models that have been downloaded
Copilot AI review requested due to automatic review settings February 6, 2026 13:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

* origin/main: (54 commits)
  chore: strip posthog for sessions/models/daily only (#7079)
  tidy: clean up old benchmark and add gym (#7081)
  fix: use command.process_group(0) for CLI providers, not just MCP (#7083)
  added build notify (#6891)
  test(mcp): add image tool test and consolidate MCP test fixtures (#7019)
  fix: remove Option from model listing return types, propagate errors (#7074)
  fix: lazy provider creation for goose acp (#7026) (#7066)
  Smoke tests: split compaction test and use debug build (#6984)
  fix(deps): trim bat to resolve RUSTSEC-2024-0320 (#7061)
  feat: expose AGENT_SESSION_ID env var to extension child processes (#7072)
  fix: add XML tool call parsing fallback for Qwen3-coder via Ollama (#6882)
  Remove clippy too_many_lines lint and decompose long functions (#7064)
  refactor: move disable_session_naming into AgentConfig (#7062)
  Add global config switch to disable automatic session naming (#7052)
  docs: add blog post - 8 Things You Didn't Know About Code Mode (#7059)
  fix: ensure animated elements are visible when prefers-reduced-motion is enabled (#7047)
  Show recommended model on failture (#7040)
  feat(ui): add session content search via API (#7050)
  docs: fix img url (#7053)
  Desktop UI for deleting custom providers (#7042)
  ...
Copilot AI review requested due to automatic review settings February 9, 2026 12:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

The shell tool returns two Content::text items: one for the assistant and
one for the user. For short outputs both contain identical text. The
callback in code_execution_extension was joining all text content without
filtering by audience, causing the output to appear twice.

Filter to only include content targeted at the assistant (or with no
audience restriction) before joining text results.
* origin/main: (30 commits)
  docs: GCP Vertex AI org policy filtering & update OnboardingProviderSetup component (#7125)
  feat: replace subagent and skills with unified summon extension (#6964)
  feat: add AGENT=goose environment variable for cross-tool compatibility (#7017)
  fix: strip empty extensions array when deeplink also (#7096)
  [docs] update authors.yaml file (#7114)
  Implement manpage generation for goose-cli (#6980)
  docs: tool output optimization (#7109)
  Fix duplicated output in Code Mode by filtering content by audience (#7117)
  Enable tom (Top Of Mind) platform extension by default (#7111)
  chore: added notification for canary build failure (#7106)
  fix: fix windows bundle random failure and optimise canary build (#7105)
  feat(acp): add model selection support for session/new and session/set_model (#7112)
  fix: isolate claude-code sessions via stream-json session_id (#7108)
  ci: enable agentic provider live tests (claude-code, codex, gemini-cli) (#7088)
  docs: codex subscription support (#7104)
  chore: add a new scenario (#7107)
  fix: Goose Desktop missing Calendar and Reminders entitlements (#7100)
  Fix 'Edit In Place' and 'Fork Session' features (#6970)
  Fix: Only send command content to command injection classifier (excluding part of tool call dict) (#7082)
  Docs: require auth optional for custom providers (#7098)
  ...
Save uses tempfile + rename for atomic writes and fs2 advisory
file locks for cross-process safety, preventing corruption from
concurrent writes or crashes mid-write.
Avoids potential deadlock when spawn_blocking thread pool is
saturated and another task holds the lock waiting for a slot.
…al inference

Was awkwardly nested under dictation despite being used by both
dictation and local inference. cleanup_partial_downloads already
handles missing directories gracefully via if-let on read_dir.
Copy link
Collaborator

@jamadeo jamadeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I'm relying a bit on the copilot review here as well but this looks great to me.

) -> Option<String> {
if let Ok(provider_guard) = self.provider.try_lock() {
if let Some(provider) = provider_guard.as_ref() {
if provider.get_model_config().context_limit() < 9 * 1024 * 1024 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention here to omit this content for models with small context windows?

Copilot AI review requested due to automatic review settings February 19, 2026 17:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- Fix undefined 'featured' variable, use response.data instead
- Remove onFallbackRequest prop not in AppRendererProps
- Clean up unused imports and state
- Change MIN_CONTEXT_FOR_MOIM from 9MB (unreachable) to 32K tokens
- Remove embedding models from expected model list (filtered by tool_call)
Copilot AI review requested due to automatic review settings February 19, 2026 17:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@jh-block jh-block added this pull request to the merge queue Feb 19, 2026
Merged via the queue into main with commit ddd35f6 Feb 19, 2026
19 checks passed
@jh-block jh-block deleted the local-models-candle branch February 19, 2026 18:43
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these changes to the MCP App renderer intentional? This PR removed @alexhancock work from merged in here #7039

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed via #7366

aharvard added a commit that referenced this pull request Feb 19, 2026
PR #6933 (local inference provider) was based on a stale branch that
didn't include #7039's sampling changes to McpAppRenderer.tsx. When it
merged, it silently reverted all client-side sampling plumbing:

- RequestHandlerExtra and JSONRPCRequest imports
- SamplingCreateMessageParams/Response type imports
- apiHost/secretKey state + initialization useEffect
- handleFallbackRequest callback (sampling/createMessage handler)
- onFallbackRequest prop on AppRenderer

The server-side route (sampling.rs), types.ts, and chat.html all
survived — only the McpAppRenderer.tsx wiring was lost.

This restores the exact code that #7039 added and #6933 removed.
katzdave added a commit that referenced this pull request Feb 19, 2026
* 'main' of github.com:block/goose:
  docs: agent variable (#7365)
  docs: pass env vars to shell (#7361)
  docs: update sandbox topic (#7336)
  feat: add local inference provider with llama.cpp backend and HuggingFace model management (#6933)
michaelneale added a commit that referenced this pull request Feb 19, 2026
* main: (46 commits)
  chore(deps): bump hono from 4.11.9 to 4.12.0 in /ui/desktop (#7369)
  Include 3rd-party license copy for JavaScript/CSS minified files (#7352)
  docs for reasoning env var (#7367)
  docs: update skills detail page to reference Goose Summon extension (#7350)
  fix(apps): restore MCP app sampling support reverted by #6933 (#7366)
  feat: TUI client of goose-acp (#7362)
  docs: agent variable (#7365)
  docs: pass env vars to shell (#7361)
  docs: update sandbox topic (#7336)
  feat: add local inference provider with llama.cpp backend and HuggingFace model management (#6933)
  Docs: claude code uses stream-json (#7358)
  Improve link confirmation modal (#7333)
  fix(ci): deflake smoke tests for Google models (#7344)
  feat: add Cerebras provider support (#7339)
  fix: skip whitespace-only text blocks in Anthropic message (#7343)
  fix(goose-acp): heap allocations (#7322)
  Remove trailing space from links (#7156)
  fix: detect low balance and prompt for top up (#7166)
  feat(apps): add support for MCP apps to sample (#7039)
  Typescript SDK for ACP extension methods (#7319)
  ...
tlongwell-block added a commit that referenced this pull request Feb 20, 2026
* origin/main: (21 commits)
  feat(ui): show token counts directly for "free" providers (#7383)
  Update creator note (#7384)
  Remove display_name from local model API and use model ID everywhere (#7382)
  fix(summon): stop MOIM from telling models to sleep while waiting for tasks (#7377)
  Completely pointless ascii art (#7329)
  feat: add Neighborhood extension to the Extensions Library (#7328)
  feat: computer controller overhaul, adding peekaboo (#7342)
  Add blog post: Gastown Explained: How to Use Goosetown for Parallel Agentic Engineering (#7372)
  docs: type-to-search goose configure lists (#7371)
  docs: search conversation history (#7370)
  fix: stderr noise (#7346)
  chore(deps): bump hono from 4.11.9 to 4.12.0 in /ui/desktop (#7369)
  Include 3rd-party license copy for JavaScript/CSS minified files (#7352)
  docs for reasoning env var (#7367)
  docs: update skills detail page to reference Goose Summon extension (#7350)
  fix(apps): restore MCP app sampling support reverted by #6933 (#7366)
  feat: TUI client of goose-acp (#7362)
  docs: agent variable (#7365)
  docs: pass env vars to shell (#7361)
  docs: update sandbox topic (#7336)
  ...

# Conflicts:
#	Cargo.lock
bavadim added a commit to redsquad-tech/is-goose that referenced this pull request Feb 21, 2026
… HuggingFace model management (block#6933)"

This reverts commit ddd35f6.
bavadim added a commit to redsquad-tech/is-goose that referenced this pull request Feb 21, 2026
…kend and HuggingFace model management (block#6933)""

This reverts commit b9bd830.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants