Welcome! This repo contains everything you need to follow along during the Local AI Workshop. By the end, you'll have a fully local LLM setup including:
- Running models with Ollama
- Integrating models with VS Code
- Building a local agent server with a simple Web + CLI interface
If you'd like to follow along please complete the following before the session to ensure you can.
If you donโt already have Homebrew:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"Weโll use uv, a superfast Python package and virtual environment manager.
Install uv:
brew install uvOnce the repo is cloned, for each service check the relevant README.md to see service specific setup.
Please install the following extensions:
- Continue โ Highly configurable Agent interface
- Cline (optional) - Heavy duty task and workflow agent interface
- GitHub Copilot (optional)
Install Ollama:
brew install ollamaStart the Ollama daemon:
ollama serveYou could also install the GUI version, but you will be using the command line anyway to add models. Additionally running the server via the command line gives access to more logs.
Models tend to be large and may take time to download.
The models I use most often:
ollama pull qwen3:8b
ollama pull qwen2.5-coder:1.5b
ollama pull qwen2.5-coder:7b
ollama pull nomic-embed-textThis is my current go to and is based on having 18GB of unified memory - if you have less you will need to scale down. I have managed to also get qen3:14 running for some tasks.
Take a look online for recommendations or the model specs, or read the next section.
Some important hyperparameters of models include:
- Number of parameters (๐ค)
- Context window size (๐)
- Quantization level (๐๏ธ)
These factors impact memory and/or VRAM usage, determining which models you can run. By adjusting one down, you might allow yourself to increase another.
Consider your use case requirements when making a decision. The VRAM estimator can help with calculations. Ollama also provides a rough model selection guide.
For model selection, consider its intended purpose (๐ก) and whether it has tool calling capabilities. You can search for models on the Ollama website.
Generally speaking my experience and recommendation is as follows:
- qwen3:8b & qwen3:14b
- The only open source Ollama models that work in Chat, Plan and Agent modes
- Good for plan and act approaches
- Sometimes need to be told not to think concisely and avoid thought loops
- Struggle with full context windows
- Be careful with registering too many rules and/or tools as they will be included in every system prompt.
- Benefit from frequent session resets
- Are very receptive to Markdown
- qwen2.5-coder:1.5b
- Nice small model for autocompletes - you may need to adjust timeouts and other settings for them to succeed
- qwen2.5-coder:7b
- Strong performing coder and relatively fast.
- Unline qwen3, it doesn't have a thinking stage which means you are less likely to get the LLM's thoughts in your edit
- Good for edits and applies
- nomic-embed-text
- Recommended by Ollama, continue.dev and others as a replacement for the default embeddings used for the @codebase index
Once cloned, this repo will include at the root:
agent-cli/: Command-line interface tools for interacting with the agent. Contains.python-version,README.md,package.json,pyproject.toml, and anagent/directory withPROMPT.mdandagent.json.agent-web-interface/: Web-based interface or API for managing the agent. Includesagent_config.py,host.py,main.py,.python-version,README.md, andpyproject.toml.fastmcp-util-server/: Backend server for utility functions related to fastmcp. Containsmain.py,.python-version,README.md,pyproject.toml, andserver-info.json.mcp-sentiment-server/: Server for sentiment analysis or machine learning tasks (MCP). Includesmain.py,.python-version,Dockerfile,README.md, ajdpyproject.toml..continue/: Configuration directory for the continue extension, defining MCP Components..taskmaster/: Taskmaster-ai workspace for managing tasks and workflows. Containsstate.json,templates/,tasks/, anddocs/..cursor/: Likely related to code navigation or UI components (e.g., cursor-based tools). Containsmcp.jsonand arules/directory with files likecursor_rules.mdc,self_improve.mdc, and workflow definitions..clinerrules/: Configuration directory for cline rules (code formatting/linting). Containsdocumentation_guidelines.mdandexpand-requirements-to-mcp-prompt.md.
.env.example: Sample environment variables for development setup.docker-compose.yml: Docker configuration for containerized services (not complete)..clineignore: File specifying files to ignore for cline tools.Makefile: Build automation tool for compiling and packaging the project.memory.jsonl: Stores memory data for@modelcontextprotocol/server-memory(can be configured).
README.md: This file (current location).GOTCHAS.md: Notes on common pitfalls, setup issues, or best practices.docs/definitions.md: Definitions and explanations of key concepts.docs/report.md: Report or analysis documentation.TODO: List of tasks or improvements to be addressed.
This repository is designed to give examples of how to implement MCP Components and Capabilities, along with Agents to use them, and integrate this within VS Code. Some of the below and other aspects are inspired by https://huggingface.co/mcp-course.
The three Components in MCP are:
- Host
- Client
- Server
The user-facing AI application that end-users interact with directly. Examples include Anthropicโs Claude Desktop, AI-enhanced IDEs like Cursor, inference libraries like Hugging Face Python SDK, or custom applications built in libraries like LangChain or smolagents. Hosts initiate connections to MCP Servers and orchestrate the overall flow between user requests, LLM processing, and external tools.
Some examples of hosts within this repository are:
agent-web-interface: a simple chat host with a web interfaceagent-cli: an extremely lightweight host with a CLIcontinue: VS Code extension, configured via.continuecline: VS Code extension
A component within the host application that manages communication with a specific MCP Server. Each Client maintains a 1:1 connection with a single Server, handling the protocol-level details of MCP communication and acting as an intermediary between the Hostโs logic and the external Server.
Clients usually sit within a Host so the implementation details may not be evident, but you will notice that Hosts often expect configuration of MPC Servers and LLMs, and this configuration is used by the Client
Some examples of clients configurations within this repository are:
agent-web-interface/agent_config.pyagent-cli/agent/agent.json.continue/assistantsand.continue/mcpServers
cline also exposes some configuration somewhere.
An external program or service that exposes capabilities (Tools, Resources, Prompts) via the MCP protocol.
Some examples of local Servers defined within this repository are:
mcp-sentiment-server: simple Gradio SSE MCP server with a sentiment-analysis tool (which is not very good)fastmcp-util-server: modern fastmcp http-streaming MCP server with local util tools
Some examples of external Servers defined within this repository are:
@modelcontextprotocol/server-memory: Knowledge Graph memory for an agenttaskmaster-ai: Task management tool for agents
MCP defines the following Capabilities:
- Tools: Executable functions that the AI model can invoke to perform actions or retrieve computed data
- Resources: Read-only data sources that provide context without significant computation.
- Prompts: Pre-defined templates or workflows that guide interactions between users, AI models, and the available capabilities.
- Sampling: Server-initiated requests for the Client/Host to perform LLM interactions, enabling recursive actions where the LLM can review generated content and make further
I think that ./continue best exhibits the configuration of these capabilities:
./continue/mcpServersexpose Tools./continue/docsand various.mdfiles in the repo act as Resources that provide context./continue/promptsand./continue/rulesare Prompts that initiate and constraint, respectively, interactions with the AI models
The following might be useful for debugging.
It is incredibly important to monitor your memory usage and adjust your prompts/models appropriately. Use the Activity Monitor program to view memory usage via the Memory tab. This also helps you to identify if any models are secretly running in the background.
Almost all interactions will be viewable via the Developer Console. In order to view debug logs, which contain extra information, click the dropdown at the top that says โDefault levelsโ and select โVerboseโ. To enable:
cmd+shift+P- Search for and then select โDeveloper: Toggle Developer Toolsโ
- Select the Console tab
- Click the dropdown at the top that says โDefault levelsโ and select โVerboseโ.
- Read the console logs
continue allows us to configure data stores that all or a selection of agent events are stored.
For example this config sends all events to a directory in this repo:
data:
- name: Local Data Bank
destination: file:///Users/conor.fehilly/Documents/repos/mcp-examples/.continue/data_stores
schema: 0.2.0
level: allThere will be a jsonl file for each event type:
chatFeedback.jsonlchatInteraction.jsonleditInteraction.jsonleditOutcome.jsonltokensGenerated.jsonltoolUsage.jsonl
Ensure you run Ollama via the command line:
ollama serveThis ensures you can view the server logs easily.
You can check for running models via:
ollama psSometimes stopping execution via Hosts isn't enough and the model will still be running. In this case you can stop running models via Ollama, for example:
ollama stop qwen3:14bAnd if that doesn't work, in the running Ollama server:
CTRL-CAs at https://docs.continue.dev/troubleshooting, to enable the Continue Console:
cmd+shift+P-> Search for and then select โContinue: Enable Consoleโ and enablecmd+shift+P-> Reload the windowcmd+shift+P-> Search for โContinue: Focus on Continue Console Viewโ to Open the Continue Console
MCP Inspector provides an easy way to interact with MCP servers via a Web UI. You should use the same transports/entry commands for the MCP server as you have configured for the Host you are using.
You can run it via:
npx @modelcontextprotocol/inspectorAlternatively, fastmcp-util-server has MCP Inspector support by default, but only over STDIO, which you can access by:
cd fastmcp-util-server
uv run fastmcp dev main.py:mcpMCP Inspector will open in your browser on http://localhost:6274/ by default.
In some situations the following can improve performance.
continue recommends for Ollama using nomic-embed-text,
- name: Nomic Embed Text
provider: ollama
model: nomic-embed-text
roles:
- embed
apiBase: http://0.0.0.0:11434Before launching the Ollama server run:
export OLLAMA_FLASH_ATTENTION=1This may have been switched on by default depending on your model. When on, the attention mechanism changes to one which uses less VRAM and is quicker.
This controls how the context cache is quantised:
f16- high precision and memory usage (default).q8_0- 8-bit quantization, uses approximately 1/2 the memory of f16 with a very small loss in precision, this usually has no noticeable impact on the model's quality (recommended if not using f16).q4_0- 4-bit quantization, uses approximately 1/4 the memory of f16 with a small-medium loss in precision that may be more noticeable at higher context sizes.
Quantisation reduces the context cache memory footprint and VRAM usage, enabling larger context usage. Set this to the type of quantisation you want:
export OLLAMA_KV_CACHE_TYPE="q8_0"OLLAMA_FLASH_ATTENTION must be set.
For more details on when and how you should or should not use quantisation, read this article, which also has an interactive vram estimator
Context size, and how much of that context is being used, have a quite a few impacts on performance - both on tokens per second and the content that is generated.
Smaller context size will mean fewer tokens can be processed in a single interaction, which can lead to faster response times but may limit the model's ability to understand longer or more complex conversations. This is particularly important in scenarios where the user expects concise, direct answers rather than nuanced, multi-turn dialogues.
With a larger context size you can process more tokens in a single interaction, which can improve the model's ability to understand and generate more coherent, contextually rich responses but may increase latency and resource usage. This is beneficial for tasks requiring deep reasoning, multi-step problem-solving, or maintaining context across extended interactions, but may not be optimal for real-time or low-latency applications.
You should consider the following when deciding on context size:
- Available VRAM: Larger context sizes require more memory, so ensure your system can handle the chosen model's requirements.
- Use Case: Prioritize speed for rapid, straightforward queries or richer context for complex tasks.
- Quantization: Pair larger context sizes with lower-precision quantization (e.g., q4_0) to balance performance and resource constraints.
- Model Capabilities: Some models are optimized for specific context lengths; consult the model's documentation for recommended settings.
- User Experience: Balance response quality with interaction speed based on your application's goals.
By carefully tuning context size alongside quantization and flash attention settings, you can optimize both performance and output quality for your specific workload.
๐ Continue CLI Docs ๐
Need help before we start? Reach out on gchat to me.