feat: add local inference provider with llama.cpp backend and HuggingFace model management#6933
feat: add local inference provider with llama.cpp backend and HuggingFace model management#6933
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces support for running local LLMs via Candle-backed inference, including backend provider integration, REST endpoints for managing model downloads, and desktop UI to configure and select local models.
Changes:
- Add a new
localprovider in the coregoosecrate that loads quantized GGUF models with architecture-aware handling (Llama, Phi, Phi-3) and custom chat templates, plus support for streaming responses. - Expose new server routes and OpenAPI definitions for listing local models, starting/canceling downloads, querying download progress, and deleting downloaded models, reusing the shared download manager.
- Extend the desktop UI and API client to manage local LLM downloads, display progress, select an active local model, and surface dedicated guidance when the
localprovider is selected.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
ui/desktop/src/components/settings/models/subcomponents/SwitchModelModal.tsx |
Adds a special informational panel when the local provider is selected, guiding users to the Models settings to download local models instead of showing generic provider error UI. |
ui/desktop/src/components/settings/models/ModelsSection.tsx |
Embeds the new LocalInferenceSettings card into the Models settings page so users can manage local LLMs alongside cloud providers. |
ui/desktop/src/components/settings/localInference/LocalInferenceSettings.tsx |
New React component for listing available local LLMs, starting/canceling downloads with progress visualization, and persisting the selected local model and provider in config. |
ui/desktop/src/components/settings/dictation/LocalModelManager.tsx |
Adjusts the dictation local model manager’s “show all / recommended only” logic and toggle label to match the new local LLM UI behavior. |
ui/desktop/src/api/types.gen.ts |
Extends generated types with LocalLlmModel, LocalModelResponse, ModelTier, and request/response types for the /local-inference/models endpoints. |
ui/desktop/src/api/sdk.gen.ts |
Adds typed client functions (listLocalModels, downloadLocalModel, getLocalModelDownloadProgress, cancelLocalModelDownload, deleteLocalModel) for the new local inference endpoints. |
ui/desktop/src/api/index.ts |
Re-exports the new local inference client functions and types so the UI can consume them through the existing API barrel. |
ui/desktop/openapi.json |
Updates the desktop-copied OpenAPI spec with the new /local-inference/models paths, schemas for LocalLlmModel, LocalModelResponse, and ModelTier, and ties them into the existing components section. |
crates/goose/src/providers/mod.rs |
Registers the new local_inference provider module in the providers namespace. |
crates/goose/src/providers/local_inference.rs |
Implements the LocalInferenceProvider with model metadata, recommendation logic, GGUF loading for Llama/Phi/Phi-3, prompt templating, and both non-streaming and streaming completion APIs. |
crates/goose/src/providers/init.rs |
Adds LocalInferenceProvider to the provider registry so it can be created and used like other built-in providers. |
crates/goose/src/dictation/download_manager.rs |
Generalizes the download manager to optionally set an arbitrary config key/value on successful download completion, so it can be reused for both dictation and local LLM models. |
crates/goose/Cargo.toml |
Registers new examples (candle_quantized, test_local_provider) to exercise the local quantized model and provider behavior. |
crates/goose-server/src/routes/utils.rs |
Treats the local provider as always “configured” in provider status checks, bypassing API key requirements since local models are configured via file downloads. |
crates/goose-server/src/routes/mod.rs |
Wires the new local_inference route module into the main router so its endpoints are served. |
crates/goose-server/src/routes/local_inference.rs |
Defines REST handlers for listing local models, starting downloads of model and tokenizer files, querying combined download progress, canceling downloads, and deleting model/tokenizer files. |
crates/goose-server/src/routes/dictation.rs |
Updates dictation model downloads to use the new generalized download manager signature and set the dictation config key on download completion. |
crates/goose-server/src/openapi.rs |
Registers the new local inference routes and schemas (LocalModelResponse, LocalLlmModel, ModelTier) with utoipa so they appear in the server’s OpenAPI output. |
| fn convert_error(e: anyhow::Error) -> ErrorResponse { | ||
| let error_msg = e.to_string(); | ||
|
|
||
| if error_msg.contains("not configured") || error_msg.contains("not found") { | ||
| ErrorResponse { | ||
| message: error_msg, | ||
| status: StatusCode::PRECONDITION_FAILED, | ||
| } | ||
| } else if error_msg.contains("already in progress") { | ||
| ErrorResponse { | ||
| message: error_msg, | ||
| status: StatusCode::BAD_REQUEST, | ||
| } | ||
| } else { | ||
| ErrorResponse::internal(error_msg) | ||
| } |
There was a problem hiding this comment.
convert_error maps errors containing "not found" to HTTP 412 PRECONDITION_FAILED, but the OpenAPI annotations and generated TypeScript types for the local inference download‑cancel route document a 404 "Download not found" error, so a missing download currently produces a status that does not match the declared API; updating this branch to return 404 for missing downloads will align server behavior with the documented contract and client expectations.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| // Download tokenizer file (set config and provider when this completes) | ||
| // We'll set GOOSE_PROVIDER to "local" after the tokenizer download completes | ||
| // This is handled in the download_manager callback |
There was a problem hiding this comment.
The comment here says that GOOSE_PROVIDER will be set to "local" when the tokenizer download completes, but the download_model call only passes LOCAL_LLM_MODEL_CONFIG_KEY as the config key, so GOOSE_PROVIDER/GOOSE_MODEL are never updated in this path; either the comment should be corrected or the code should also set those config values if that behavior is required.
| // Download tokenizer file (set config and provider when this completes) | |
| // We'll set GOOSE_PROVIDER to "local" after the tokenizer download completes | |
| // This is handled in the download_manager callback | |
| // Download tokenizer file and update the local LLM model config when this completes. | |
| // This is handled in the download_manager callback using LOCAL_LLM_MODEL_CONFIG_KEY. |
| // Special override | ||
| if metadata.name == "local" { | ||
| return true; |
There was a problem hiding this comment.
The hardcoded check for provider name "local" bypasses the normal configuration check. While this may be intentional for local models that don't require auth, it should verify that at least one model is downloaded before returning true. Otherwise, the UI may show the provider as configured when no models are available for use.
| // Special override | |
| if metadata.name == "local" { | |
| return true; | |
| // Special override for local provider: only consider it configured if at least one model is available. | |
| if metadata.name == "local" { | |
| if let Ok(loaded_provider) = load_provider(metadata.name.as_str()) { | |
| if !loaded_provider.models.is_empty() { | |
| return true; | |
| } | |
| } |
| ) -> Option<String> { | ||
| if let Ok(provider_guard) = self.provider.try_lock() { | ||
| if let Some(provider) = provider_guard.as_ref() { | ||
| if provider.get_model_config().context_limit() < 9 * 1024 * 1024 { |
There was a problem hiding this comment.
Context limit check has incorrect calculation. Line 1485 compares context_limit() against 9 * 1024 * 1024 (9 million tokens), but context_limit() returns token count, not bytes. This threshold is unrealistically high - even the largest models have ~200K token limits. This should be 9 * 1024 (9K tokens) instead.
| if provider.get_model_config().context_limit() < 9 * 1024 * 1024 { | |
| if provider.get_model_config().context_limit() < 9 * 1024 { |
| eprintln!( | ||
| "DEBUG: Loading platform extension '{}': {}", | ||
| name, provider_state |
There was a problem hiding this comment.
Debug logging using eprintln! should be removed or converted to tracing::debug!. These debug statements were likely added during development and should not be left in production code.
| eprintln!( | |
| "DEBUG: Loading platform extension '{}': {}", | |
| name, provider_state | |
| tracing::debug!( | |
| "Loading platform extension '{}': {}", | |
| name, | |
| provider_state |
| let context_limit = context.get_context_limit(); | ||
| eprintln!( | ||
| "DEBUG: TodoClient::new - context_limit from provider: {:?}", | ||
| context_limit | ||
| ); |
There was a problem hiding this comment.
Debug logging using eprintln! should be removed or converted to tracing::debug!. These debug statements were likely added during development and should not be left in production code.
| eprintln!( | ||
| "DEBUG: AppsManagerClient::new - model: {:?}, context_limit: {:?}", | ||
| model_name, context_limit | ||
| ); | ||
|
|
||
| match context.require_min_context(10_000, EXTENSION_NAME) { | ||
| Ok(_) => eprintln!("DEBUG: AppsManagerClient context check PASSED"), | ||
| Err(e) => { | ||
| eprintln!("DEBUG: AppsManagerClient context check FAILED: {}", e); | ||
| return Err(e.to_string()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Debug logging using eprintln! should be removed or converted to tracing::debug!. These debug statements were likely added during development and should not be left in production code.
| crate::prompt_template::render_template("tiny_model_system.md", &context) | ||
| .unwrap_or_else(|e| { | ||
| // Fallback if template fails to load | ||
| eprintln!("WARNING: Failed to load tiny_model_system.md: {:?}", e); |
There was a problem hiding this comment.
Error logging using eprintln! should use tracing::error! or tracing::warn! instead for consistency with the rest of the codebase.
| eprintln!("WARNING: Failed to load tiny_model_system.md: {:?}", e); | |
| tracing::warn!("Failed to load tiny_model_system.md: {:?}", e); |
| const pollDownloadProgress = (modelId: string) => { | ||
| const interval = setInterval(async () => { | ||
| try { | ||
| const response = await getLocalModelDownloadProgress({ path: { model_id: modelId } }); | ||
| if (response.data) { | ||
| const progress = response.data; | ||
| setDownloads((prev) => new Map(prev).set(modelId, progress)); | ||
|
|
||
| if (progress.status === 'completed') { | ||
| clearInterval(interval); | ||
| await loadModels(); // Refresh model list | ||
| // Auto-select the model that was just downloaded | ||
| await selectModel(modelId); | ||
| } else if (progress.status === 'failed') { | ||
| clearInterval(interval); | ||
| await loadModels(); | ||
| } | ||
| } else { | ||
| clearInterval(interval); | ||
| } | ||
| } catch { | ||
| clearInterval(interval); | ||
| } | ||
| }, 500); | ||
| }; |
There was a problem hiding this comment.
Memory leak: interval is not stored and cannot be cleared on component unmount. Store the interval ID in a ref or state and clear it in useEffect cleanup to prevent the polling from continuing after the component unmounts.
And only allow selection of models that have been downloaded
* origin/main: (54 commits) chore: strip posthog for sessions/models/daily only (#7079) tidy: clean up old benchmark and add gym (#7081) fix: use command.process_group(0) for CLI providers, not just MCP (#7083) added build notify (#6891) test(mcp): add image tool test and consolidate MCP test fixtures (#7019) fix: remove Option from model listing return types, propagate errors (#7074) fix: lazy provider creation for goose acp (#7026) (#7066) Smoke tests: split compaction test and use debug build (#6984) fix(deps): trim bat to resolve RUSTSEC-2024-0320 (#7061) feat: expose AGENT_SESSION_ID env var to extension child processes (#7072) fix: add XML tool call parsing fallback for Qwen3-coder via Ollama (#6882) Remove clippy too_many_lines lint and decompose long functions (#7064) refactor: move disable_session_naming into AgentConfig (#7062) Add global config switch to disable automatic session naming (#7052) docs: add blog post - 8 Things You Didn't Know About Code Mode (#7059) fix: ensure animated elements are visible when prefers-reduced-motion is enabled (#7047) Show recommended model on failture (#7040) feat(ui): add session content search via API (#7050) docs: fix img url (#7053) Desktop UI for deleting custom providers (#7042) ...
The shell tool returns two Content::text items: one for the assistant and one for the user. For short outputs both contain identical text. The callback in code_execution_extension was joining all text content without filtering by audience, causing the output to appear twice. Filter to only include content targeted at the assistant (or with no audience restriction) before joining text results.
* origin/main: (30 commits) docs: GCP Vertex AI org policy filtering & update OnboardingProviderSetup component (#7125) feat: replace subagent and skills with unified summon extension (#6964) feat: add AGENT=goose environment variable for cross-tool compatibility (#7017) fix: strip empty extensions array when deeplink also (#7096) [docs] update authors.yaml file (#7114) Implement manpage generation for goose-cli (#6980) docs: tool output optimization (#7109) Fix duplicated output in Code Mode by filtering content by audience (#7117) Enable tom (Top Of Mind) platform extension by default (#7111) chore: added notification for canary build failure (#7106) fix: fix windows bundle random failure and optimise canary build (#7105) feat(acp): add model selection support for session/new and session/set_model (#7112) fix: isolate claude-code sessions via stream-json session_id (#7108) ci: enable agentic provider live tests (claude-code, codex, gemini-cli) (#7088) docs: codex subscription support (#7104) chore: add a new scenario (#7107) fix: Goose Desktop missing Calendar and Reminders entitlements (#7100) Fix 'Edit In Place' and 'Fork Session' features (#6970) Fix: Only send command content to command injection classifier (excluding part of tool call dict) (#7082) Docs: require auth optional for custom providers (#7098) ...
Save uses tempfile + rename for atomic writes and fs2 advisory file locks for cross-process safety, preventing corruption from concurrent writes or crashes mid-write.
Avoids potential deadlock when spawn_blocking thread pool is saturated and another task holds the lock waiting for a slot.
…al inference Was awkwardly nested under dictation despite being used by both dictation and local inference. cleanup_partial_downloads already handles missing directories gracefully via if-let on read_dir.
jamadeo
left a comment
There was a problem hiding this comment.
Very nice! I'm relying a bit on the copilot review here as well but this looks great to me.
| ) -> Option<String> { | ||
| if let Ok(provider_guard) = self.provider.try_lock() { | ||
| if let Some(provider) = provider_guard.as_ref() { | ||
| if provider.get_model_config().context_limit() < 9 * 1024 * 1024 { |
There was a problem hiding this comment.
Is the intention here to omit this content for models with small context windows?
crates/goose/src/providers/local_inference/inference_emulated_tools.rs
Outdated
Show resolved
Hide resolved
Replace config_key/config_value params with a generic on_complete callback, so the download manager doesn't need to know about config.
Nothing reads this environment variable.
- Fix undefined 'featured' variable, use response.data instead - Remove onFallbackRequest prop not in AppRendererProps - Clean up unused imports and state
- Change MIN_CONTEXT_FOR_MOIM from 9MB (unreachable) to 32K tokens - Remove embedding models from expected model list (filtered by tool_call)
There was a problem hiding this comment.
Were these changes to the MCP App renderer intentional? This PR removed @alexhancock work from merged in here #7039
PR #6933 (local inference provider) was based on a stale branch that didn't include #7039's sampling changes to McpAppRenderer.tsx. When it merged, it silently reverted all client-side sampling plumbing: - RequestHandlerExtra and JSONRPCRequest imports - SamplingCreateMessageParams/Response type imports - apiHost/secretKey state + initialization useEffect - handleFallbackRequest callback (sampling/createMessage handler) - onFallbackRequest prop on AppRenderer The server-side route (sampling.rs), types.ts, and chat.html all survived — only the McpAppRenderer.tsx wiring was lost. This restores the exact code that #7039 added and #6933 removed.
* main: (46 commits) chore(deps): bump hono from 4.11.9 to 4.12.0 in /ui/desktop (#7369) Include 3rd-party license copy for JavaScript/CSS minified files (#7352) docs for reasoning env var (#7367) docs: update skills detail page to reference Goose Summon extension (#7350) fix(apps): restore MCP app sampling support reverted by #6933 (#7366) feat: TUI client of goose-acp (#7362) docs: agent variable (#7365) docs: pass env vars to shell (#7361) docs: update sandbox topic (#7336) feat: add local inference provider with llama.cpp backend and HuggingFace model management (#6933) Docs: claude code uses stream-json (#7358) Improve link confirmation modal (#7333) fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) ...
* origin/main: (21 commits) feat(ui): show token counts directly for "free" providers (#7383) Update creator note (#7384) Remove display_name from local model API and use model ID everywhere (#7382) fix(summon): stop MOIM from telling models to sleep while waiting for tasks (#7377) Completely pointless ascii art (#7329) feat: add Neighborhood extension to the Extensions Library (#7328) feat: computer controller overhaul, adding peekaboo (#7342) Add blog post: Gastown Explained: How to Use Goosetown for Parallel Agentic Engineering (#7372) docs: type-to-search goose configure lists (#7371) docs: search conversation history (#7370) fix: stderr noise (#7346) chore(deps): bump hono from 4.11.9 to 4.12.0 in /ui/desktop (#7369) Include 3rd-party license copy for JavaScript/CSS minified files (#7352) docs for reasoning env var (#7367) docs: update skills detail page to reference Goose Summon extension (#7350) fix(apps): restore MCP app sampling support reverted by #6933 (#7366) feat: TUI client of goose-acp (#7362) docs: agent variable (#7365) docs: pass env vars to shell (#7361) docs: update sandbox topic (#7336) ... # Conflicts: # Cargo.lock
… HuggingFace model management (block#6933)" This reverts commit ddd35f6.
…kend and HuggingFace model management (block#6933)"" This reverts commit b9bd830.
Summary
Adds a local inference provider that enables running language models directly on-device using llama.cpp, with full integration across CLI, server, and desktop UI.
Core Changes
Local inference provider (
crates/goose/src/providers/local_inference/): New provider with modular architecture split into submodules:inference_native_tools: Inference path for models with native tool-calling supportinference_emulated_tools: Inference path for models without native tool support, using text-based$ commandand ````execute` block detectioninference_engine: Shared inference primitives — context management, sampling, and token generationtool_parsing: Parsing tool calls from model output (JSON and XML formats)hf_models: HuggingFace Hub model search and GGUF file discoverylocal_model_registry: Local model download and lifecycle managementServer routes (
crates/goose-server/src/routes/local_inference.rs): REST endpoints for model download, listing, deletion, and statusCLI integration (
crates/goose-cli/): Newlocal-modelssubcommand for managing downloaded models, plus provider selection supportDesktop UI (
ui/desktop/src/components/):LocalModelSetup.tsx: First-run setup flow for downloading modelsLocalInferenceSettings.tsx: Settings panel for model managementHuggingFaceModelSearch.tsx: Search and download models from HuggingFaceModelSettingsPanel.tsx: Per-model configurationAdditional Changes
Known Limitations (deferred to future iterations)
Memory management is best-effort:
estimate_max_context_for_memoryuses 50% of available memory for KV cache, which is a rough heuristic.available_inference_memory_bytespicks the max of any single accelerator device, which may not be correct for multi-GPU setups or unified memory architectures. On unified-memory Macs, free memory fluctuates with system load, so a model that fits during estimation may OOM during inference. There is currently no fallback or recovery for OOM — it will crash the process.Model unloading strategy is naive: When loading a new model, all other models are unloaded. This is fine for single-model usage but the code structure (
HashMapof model slots) suggests multi-model was considered. A smarter eviction strategy (e.g. LRU) could be added if multi-model support is needed.Featured models list is hardcoded: The
FEATURED_MODELSconstant inlocal_model_registry.rsis a static list of HuggingFace model specs. There is no mechanism to update it without a code change — it will go stale as newer/better models are released. A future iteration could fetch this list from a remote config or allow user customization.